Kmer types
Bioinformatic analyses make extensive use of kmers. Kmers are contiguous sub-strings of k nucleotides of some reference sequence.
They are used extensively in bioinformatic analyses as an informational unit. This concept popularised by short read assemblers. Analyses within the kmer space benefit from a simple formulation of the sampling problem and direct in-hash comparisons.
BioSequences provides the following types to represent Kmers.
Kmers.Kmer
— TypeKmer{A<:Alphabet,K,N} <: BioSequence{A}
A parametric, immutable, bitstype for representing Kmers - short sequences. Given the number of Kmers generated from raw sequencing reads, avoiding repetetive memory allocation and triggering of garbage collection is important, as is the ability to effectively pack Kmers into arrays and similar collections.
In practice that means we an immutable bitstype as the internal representation of these sequences. Thankfully, this is not much of a limitation - kmers are rarely manipulated and so by and large don't have to be mutable.
Excepting their immutability, they fulfill the rest of the API and behaviours expected from a concrete BioSequence
type, and non-mutating transformations of the type are still defined.
Given their immutability, setindex
and mutating sequence transformations are not implemented for Kmers e.g. reverse_complement!
.
Note that some sequence transformations that are not mutating are available, since they can return a new kmer value as a result e.g. reverse_complement
.
The following aliases are also defined:
Kmers.DNAKmer
— TypeShortcut for the type Kmer{DNAAlphabet{2},K,N}
Kmers.DNA27mer
— TypeShortcut for the type DNAKmer{27,1}
Kmers.DNA31mer
— TypeShortcut for the type DNAKmer{31,1}
Kmers.DNA63mer
— TypeShortcut for the type DNAKmer{63,2}
Kmers.RNAKmer
— TypeShortcut for the type Kmer{RNAAlphabet{2},K,N}
Kmers.RNA27mer
— TypeShortcut for the type RNAKmer{27,1}
Kmers.RNA31mer
— TypeShortcut for the type RNAKmer{31,1}
Kmers.RNA63mer
— TypeShortcut for the type RNAKmer{63,2}
Skipmers
For some analyses, the contiguous nature of kmers imposes limitations. A single base difference, due to real biological variation or a sequencing error, affects all k-mers crossing that position thus impeding direct analyses by identity. Also, given the strong interdependence of local sequence, contiguous sections capture less information about genome structure, and so they are more affected by sequence repetition.
Skipmers are a generalisation of the concept of a kmer. They are created using a cyclic pattern of used-and-skipped positions which achieves increased entropy and tolerance to nucleotide substitution differences by following some simple rules.
Skipmers preserve many of the elegant properties of kmers such as reverse complementability and existence of a canonical representation. Also, using cycles of three greatly increases the power of direct intersection between the genomes of different organisms by grouping together the more conserved nucleotides of protein-coding regions.
BioSequences currently does not provide a separate type for skipmers, they are represented using Mer
and BigMer
as their representation as a short immutable sequence encoded in an unsigned integer is the same. The distinction lies in how they are generated.
Skipmer generation
A skipmer is a simple cyclic q-gram that includes m out of every n bases until a total of k bases is reached.
This is illustrated in the figure below (from this paper.):
To maintain cyclic properties and the existence of the reverse-complement as a skipmer defined by the same function, k should be a multiple of m.
This also enables the existence of a canonical representation for each skipmer, defined as the lexicographically smaller of the forward and reverse-complement representations.
Defining m, n and k fixes a value for S, the total span of the skipmer, given by:
To see how to iterate over skipmers cf. kmers, see the Iteration section of the manual.