Kmer types

Bioinformatic analyses make extensive use of kmers. Kmers are contiguous sub-strings of k nucleotides of some reference sequence.

They are used extensively in bioinformatic analyses as an informational unit. This concept popularised by short read assemblers. Analyses within the kmer space benefit from a simple formulation of the sampling problem and direct in-hash comparisons.

BioSequences provides the following types to represent Kmers.

Kmers.KmerType
Kmer{A<:Alphabet,K,N} <: BioSequence{A}

A parametric, immutable, bitstype for representing Kmers - short sequences. Given the number of Kmers generated from raw sequencing reads, avoiding repetetive memory allocation and triggering of garbage collection is important, as is the ability to effectively pack Kmers into arrays and similar collections.

In practice that means we an immutable bitstype as the internal representation of these sequences. Thankfully, this is not much of a limitation - kmers are rarely manipulated and so by and large don't have to be mutable.

Excepting their immutability, they fulfill the rest of the API and behaviours expected from a concrete BioSequence type, and non-mutating transformations of the type are still defined.

Warning

Given their immutability, setindex and mutating sequence transformations are not implemented for Kmers e.g. reverse_complement!.

Tip

Note that some sequence transformations that are not mutating are available, since they can return a new kmer value as a result e.g. reverse_complement.

source

The following aliases are also defined:

Skipmers

For some analyses, the contiguous nature of kmers imposes limitations. A single base difference, due to real biological variation or a sequencing error, affects all k-mers crossing that position thus impeding direct analyses by identity. Also, given the strong interdependence of local sequence, contiguous sections capture less information about genome structure, and so they are more affected by sequence repetition.

Skipmers are a generalisation of the concept of a kmer. They are created using a cyclic pattern of used-and-skipped positions which achieves increased entropy and tolerance to nucleotide substitution differences by following some simple rules.

Skipmers preserve many of the elegant properties of kmers such as reverse complementability and existence of a canonical representation. Also, using cycles of three greatly increases the power of direct intersection between the genomes of different organisms by grouping together the more conserved nucleotides of protein-coding regions.

BioSequences currently does not provide a separate type for skipmers, they are represented using Mer and BigMer as their representation as a short immutable sequence encoded in an unsigned integer is the same. The distinction lies in how they are generated.

Skipmer generation

A skipmer is a simple cyclic q-gram that includes m out of every n bases until a total of k bases is reached.

This is illustrated in the figure below (from this paper.):

skipmer-fig

To maintain cyclic properties and the existence of the reverse-complement as a skipmer defined by the same function, k should be a multiple of m.

This also enables the existence of a canonical representation for each skipmer, defined as the lexicographically smaller of the forward and reverse-complement representations.

Defining m, n and k fixes a value for S, the total span of the skipmer, given by:

\[S = n * (\frac{k}{m} - 1) + m\]

To see how to iterate over skipmers cf. kmers, see the Iteration section of the manual.