Basic types and basic kmer counting
MerCounting has a few basic types and methods to allow you do easily do some basic kmer counting.
The first and perhaps most obvious of these is a type to represent a kmer and its frequency, for this, MerCounting provides the MerCount type.
If you have a MerCount variable you can get the kmer value or the frequency value using the mer and freq getter methods.
A vector of these MerCounts constitute one option for a simple data structure for representing kmer frequency data. This packages defines other more dedicated types, but we will get to those in later sections of this manual.
Ok, so let's do some very basic kmer counting for a sequence!
First we need a sequence:
julia> using KmerAnalysis
julia> using BioSequences
julia> s = randdnaseq(50)
50nt DNA Sequence:
GAAGTTGCCTATGACTAGGGACAAACAAGATGCCCTCTTGCAGCCTGTACOk, let's say we wanted to count the 7-mers, we can collect all the 7-mers using BioSequences' kmer iterator:
julia> sevenmers = collect(each(DNAMer{7}, s))
44-element Array{BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}},1}:
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(1, GAAGTTG, CAACTTC)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(2, AAGTTGC, GCAACTT)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(3, AGTTGCC, GGCAACT)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(4, GTTGCCT, AGGCAAC)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(5, TTGCCTA, TAGGCAA)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(6, TGCCTAT, ATAGGCA)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(7, GCCTATG, CATAGGC)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(8, CCTATGA, TCATAGG)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(9, CTATGAC, GTCATAG)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(10, TATGACT, AGTCATA)
⋮
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(36, TCTTGCA, TGCAAGA)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(37, CTTGCAG, CTGCAAG)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(38, TTGCAGC, GCTGCAA)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(39, TGCAGCC, GGCTGCA)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(40, GCAGCCT, AGGCTGC)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(41, CAGCCTG, CAGGCTG)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(42, AGCCTGT, ACAGGCT)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(43, GCCTGTA, TACAGGC)
BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(44, CCTGTAC, GTACAGG)
julia> sevenmers[1]
Mer iteration result:
Position: 1
Forward: GAAGTTG
Backward: CAACTTCOk BioSequences' kmer iterator on its own yields a MerIterResult, which contains a tuple of (Base position in sequence, kmer, reverse complement of the kmer).
Now let's say I was only interested in counting canonical kmers in s. For any kmer and its reverse complement, the canonical version is the one that is lexicographically less.
So let's revise our collect above to give us a vector of the canonical kmers. Thankfully, BioSequences comes with a method called canonical, for exactly this purpose.
julia> sevenmers = collect(canonical(x) for x in each(DNAMer{7}, s))
44-element Array{BioSequences.Mer{BioSequences.DNAAlphabet{2},7},1}:
CAACTTC
AAGTTGC
AGTTGCC
AGGCAAC
TAGGCAA
ATAGGCA
CATAGGC
CCTATGA
CTATGAC
AGTCATA
⋮
TCTTGCA
CTGCAAG
GCTGCAA
GGCTGCA
AGGCTGC
CAGCCTG
ACAGGCT
GCCTGTA
CCTGTACIf you wanted to collect all the kmers in s "as is" as it were, you can use the fwmer method provided with BioSequences for this purpose.
julia> collect(fwmer(x) for x in each(DNAMer{7}, s))
44-element Array{BioSequences.Mer{BioSequences.DNAAlphabet{2},7},1}:
GAAGTTG
AAGTTGC
AGTTGCC
GTTGCCT
TTGCCTA
TGCCTAT
GCCTATG
CCTATGA
CTATGAC
TATGACT
⋮
TCTTGCA
CTTGCAG
TTGCAGC
TGCAGCC
GCAGCCT
CAGCCTG
AGCCTGT
GCCTGTA
CCTGTACThis is quite nice and terse code, but we haven't used KmerAnalysis to help us at all yet. MerCounting allows us to achieve the above with the use of a single function called collect_mers, instead of collecting a generator:
julia> sevenmers = collect_mers(CANONICAL, each(DNAMer{7}, s))
ERROR: MethodError: no method matching collect_mers(::Canonical, ::BioSequences.EveryMerIterator{BioSequences.Mer{BioSequences.DNAAlphabet{2},7},BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}})There are a few different methods of collect_mers for different types of input, check out the API reference for more info.
MerCounting exports CANONICAL and NONCANONICAL, which work with collect_mers and other MerCounting functions to dictate how the kmers are produced.
Ok so we have our canonical kmers, how can we count them? Well one of the ways to count the number of distinct values in a vector is to sort it and traverse it in one go. The collapse_into_counts! methods do this for you.
collapse_into_counts! will sort the input sevenmers vector in place.
julia> mercounts = collapse_into_counts!(sevenmers)
44-element Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}},1}:
AAACAAG occurs 1 times
AACAAGA occurs 1 times
AAGAGGG occurs 1 times
AAGATGC occurs 1 times
AAGTTGC occurs 1 times
ACAAACA occurs 1 times
ACAAGAT occurs 1 times
ACAGGCT occurs 1 times
ACTAGGG occurs 1 times
AGAGGGC occurs 1 times
⋮
GATGCCC occurs 1 times
GCCTGTA occurs 1 times
GCTGCAA occurs 1 times
GGACAAA occurs 1 times
GGCTGCA occurs 1 times
GGGACAA occurs 1 times
GTCCCTA occurs 1 times
TAGGCAA occurs 1 times
TCTTGCA occurs 1 timesLooks like every distinct kmer appeared once, with the possible odd exceptions. Well that's to be expected! We did just do this with a single 50bp long random sequence after all!
So for any sequence or input we want to count kmers for, we have to collect_mers and then collapse them into counts with collapse_into_counts!.
For convenience and more terse code, MerCounting provides a constructor for vectors of MerCount that does this for you.
julia> v = Vector{MerCount{DNAMer{7}}}(CANONICAL, s)
ERROR: MethodError: no method matching Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}},1}(::Canonical, ::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}})
Closest candidates are:
Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}},1}(!Matched::Nothing, ::Any...) where {T, N} at baseext.jl:31
Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}},1}(!Matched::Missing, ::Any...) where {T, N} at baseext.jl:32That's about all there is for doing basic kmer counting over individual biological sequences in julia.
Furthur on in this manual we will cover some dedicated kmer counting algorithms MerCounting provides for larger datasets like those contained in reads (using ReadDatastores.jl). We will also cover dedicated types such as kmer frequency spectra, and types useful for projecting kmer coverage and other kmer related metrics over a genome sequence.