Nucleic acid k-mers
A common strategy to simplify the analysis of sequence data is to operate or short k-mers, for size fixed size k
. These can be packed into machine integers allowing extremely efficient code. The BioSequences
module has built in support for representing short sequences in 64-bit integers. Besides being fixed length, Kmer
types, unlike other sequence types cannot contain ambiguous symbols like 'N'.
The Kmer{T,k}
type parameterized on symbol type (T
, either DNA
, or RNA
) and size k
. For ease of writing code, two type aliases for each nucleotide type are defined and named as DNAKmer{k}
and RNAKmer{k}
:
julia> DNAKmer("ACGT") # create a DNA 4-mer from a string
DNA 4-mer:
ACGT
julia> RNAKmer("ACGU") # create an RNA 4-mer from a string
RNA 4-mer:
ACGU
julia> kmer"ACGT" # DNA k-mers may also be written as literals
DNA 4-mer:
ACGT
julia> typeof(DNAKmer("ACGT"))
Kmer{DNA,4}
BioSequences.each
— Function.each(::Type{Kmer{T,k}}, seq::Sequence[, step=1])
Initialize an iterator over all k-mers in a sequence seq
skipping ambiguous nucleotides without changing the reading frame.
Arguments
Kmer{T,k}
: k-mer type to enumerate.seq
: a nucleotide sequence.step=1
: the number of positions between iterated k-mers
Examples
# iterate over DNA codons
for (pos, codon) in each(DNAKmer{3}, dna"ATCCTANAGNTACT", 3)
@show pos, codon
end
BioSequences.canonical
— Function.canonical(kmer::Kmer)
Return the canonical k-mer of x
.
A canonical k-mer is the numerical lesser of a k-mer and its reverse complement. This is useful in hashing/counting k-mers in data that is not strand specific, and thus observing k-mer is equivalent to observing its reverse complement.
BioSequences.neighbors
— Function.neighbors(kmer::Kmer)
Return an iterator through k-mers neighboring kmer
on a de Bruijn graph.