Basic types and basic kmer counting

MerCounting has a few basic types and methods to allow you do easily do some basic kmer counting.

The first and perhaps most obvious of these is a type to represent a kmer and its frequency, for this, MerCounting provides the MerCount type.

If you have a MerCount variable you can get the kmer value or the frequency value using the mer and freq getter methods.

A vector of these MerCounts constitute one option for a simple data structure for representing kmer frequency data. This packages defines other more dedicated types, but we will get to those in later sections of this manual.

Ok, so let's do some very basic kmer counting for a sequence!

First we need a sequence:

julia> using KmerAnalysis

julia> using BioSequences

julia> s = randdnaseq(50)
50nt DNA Sequence:
GAAGTTGCCTATGACTAGGGACAAACAAGATGCCCTCTTGCAGCCTGTAC

Ok, let's say we wanted to count the 7-mers, we can collect all the 7-mers using BioSequences' kmer iterator:

julia> sevenmers = collect(each(DNAMer{7}, s))
44-element Array{BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}},1}:
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(1, GAAGTTG, CAACTTC)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(2, AAGTTGC, GCAACTT)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(3, AGTTGCC, GGCAACT)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(4, GTTGCCT, AGGCAAC)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(5, TTGCCTA, TAGGCAA)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(6, TGCCTAT, ATAGGCA)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(7, GCCTATG, CATAGGC)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(8, CCTATGA, TCATAGG)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(9, CTATGAC, GTCATAG)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(10, TATGACT, AGTCATA)
 ⋮
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(36, TCTTGCA, TGCAAGA)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(37, CTTGCAG, CTGCAAG)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(38, TTGCAGC, GCTGCAA)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(39, TGCAGCC, GGCTGCA)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(40, GCAGCCT, AGGCTGC)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(41, CAGCCTG, CAGGCTG)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(42, AGCCTGT, ACAGGCT)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(43, GCCTGTA, TACAGGC)
 BioSequences.MerIterResult{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}}(44, CCTGTAC, GTACAGG)

julia> sevenmers[1]
Mer iteration result:
Position: 1
Forward: GAAGTTG
Backward: CAACTTC

Ok BioSequences' kmer iterator on its own yields a MerIterResult, which contains a tuple of (Base position in sequence, kmer, reverse complement of the kmer).

Now let's say I was only interested in counting canonical kmers in s. For any kmer and its reverse complement, the canonical version is the one that is lexicographically less.

So let's revise our collect above to give us a vector of the canonical kmers. Thankfully, BioSequences comes with a method called canonical, for exactly this purpose.

julia> sevenmers = collect(canonical(x) for x in each(DNAMer{7}, s))
44-element Array{BioSequences.Mer{BioSequences.DNAAlphabet{2},7},1}:
 CAACTTC
 AAGTTGC
 AGTTGCC
 AGGCAAC
 TAGGCAA
 ATAGGCA
 CATAGGC
 CCTATGA
 CTATGAC
 AGTCATA
 ⋮
 TCTTGCA
 CTGCAAG
 GCTGCAA
 GGCTGCA
 AGGCTGC
 CAGCCTG
 ACAGGCT
 GCCTGTA
 CCTGTAC

If you wanted to collect all the kmers in s "as is" as it were, you can use the fwmer method provided with BioSequences for this purpose.

julia> collect(fwmer(x) for x in each(DNAMer{7}, s))
44-element Array{BioSequences.Mer{BioSequences.DNAAlphabet{2},7},1}:
 GAAGTTG
 AAGTTGC
 AGTTGCC
 GTTGCCT
 TTGCCTA
 TGCCTAT
 GCCTATG
 CCTATGA
 CTATGAC
 TATGACT
 ⋮
 TCTTGCA
 CTTGCAG
 TTGCAGC
 TGCAGCC
 GCAGCCT
 CAGCCTG
 AGCCTGT
 GCCTGTA
 CCTGTAC

This is quite nice and terse code, but we haven't used KmerAnalysis to help us at all yet. MerCounting allows us to achieve the above with the use of a single function called collect_mers, instead of collecting a generator:

julia> sevenmers = collect_mers(CANONICAL, each(DNAMer{7}, s))
ERROR: MethodError: no method matching collect_mers(::Canonical, ::BioSequences.EveryMerIterator{BioSequences.Mer{BioSequences.DNAAlphabet{2},7},BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}})

There are a few different methods of collect_mers for different types of input, check out the API reference for more info.

MerCounting exports CANONICAL and NONCANONICAL, which work with collect_mers and other MerCounting functions to dictate how the kmers are produced.

Ok so we have our canonical kmers, how can we count them? Well one of the ways to count the number of distinct values in a vector is to sort it and traverse it in one go. The collapse_into_counts! methods do this for you.

Note

collapse_into_counts! will sort the input sevenmers vector in place.

julia> mercounts = collapse_into_counts!(sevenmers)
44-element Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}},1}:
 AAACAAG occurs 1 times
 AACAAGA occurs 1 times
 AAGAGGG occurs 1 times
 AAGATGC occurs 1 times
 AAGTTGC occurs 1 times
 ACAAACA occurs 1 times
 ACAAGAT occurs 1 times
 ACAGGCT occurs 1 times
 ACTAGGG occurs 1 times
 AGAGGGC occurs 1 times
 ⋮
 GATGCCC occurs 1 times
 GCCTGTA occurs 1 times
 GCTGCAA occurs 1 times
 GGACAAA occurs 1 times
 GGCTGCA occurs 1 times
 GGGACAA occurs 1 times
 GTCCCTA occurs 1 times
 TAGGCAA occurs 1 times
 TCTTGCA occurs 1 times

Looks like every distinct kmer appeared once, with the possible odd exceptions. Well that's to be expected! We did just do this with a single 50bp long random sequence after all!

So for any sequence or input we want to count kmers for, we have to collect_mers and then collapse them into counts with collapse_into_counts!.

For convenience and more terse code, MerCounting provides a constructor for vectors of MerCount that does this for you.

julia> v = Vector{MerCount{DNAMer{7}}}(CANONICAL, s)
ERROR: MethodError: no method matching Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}},1}(::Canonical, ::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}})
Closest candidates are:
  Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}},1}(!Matched::Nothing, ::Any...) where {T, N} at baseext.jl:31
  Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},7}},1}(!Matched::Missing, ::Any...) where {T, N} at baseext.jl:32

That's about all there is for doing basic kmer counting over individual biological sequences in julia.

Furthur on in this manual we will cover some dedicated kmer counting algorithms MerCounting provides for larger datasets like those contained in reads (using ReadDatastores.jl). We will also cover dedicated types such as kmer frequency spectra, and types useful for projecting kmer coverage and other kmer related metrics over a genome sequence.