Nucleic acid k-mers

A common strategy to simplify the analysis of sequence data is to operate or short k-mers, for size fixed size k. These can be packed into machine integers allowing extremely efficient code. The BioSequences module has built in support for representing short sequences in 64-bit integers. Besides being fixed length, Kmer types, unlike other sequence types cannot contain ambiguous symbols like 'N'.

The Kmer{T,k} type parameterized on symbol type (T, either DNA, or RNA) and size k. For ease of writing code, two type aliases for each nucleotide type are defined and named as DNAKmer{k} and RNAKmer{k}:

julia> DNAKmer("ACGT")  # create a DNA 4-mer from a string
DNA 4-mer:
ACGT

julia> RNAKmer("ACGU")  # create an RNA 4-mer from a string
RNA 4-mer:
ACGU

julia> kmer"ACGT" # DNA k-mers may also be written as literals
DNA 4-mer:
ACGT

julia> typeof(DNAKmer("ACGT"))
Kmer{DNA,4}

BioSequences.each — Function.

each(::Type{Kmer{T,k}}, seq::Sequence[, step=1])

Initialize an iterator over all k-mers in a sequence seq skipping ambiguous nucleotides without changing the reading frame.

Arguments

Kmer{T,k}: k-mer type to enumerate.
seq: a nucleotide sequence.
step=1: the number of positions between iterated k-mers

Examples

# iterate over DNA codons
for (pos, codon) in each(DNAKmer{3}, dna"ATCCTANAGNTACT", 3)
    @show pos, codon
end

source

BioSequences.canonical — Function.

canonical(kmer::Kmer)

Return the canonical k-mer of x.

A canonical k-mer is the numerical lesser of a k-mer and its reverse complement. This is useful in hashing/counting k-mers in data that is not strand specific, and thus observing k-mer is equivalent to observing its reverse complement.

source

BioSequences.neighbors — Function.

neighbors(kmer::Kmer)

Return an iterator through k-mers neighboring kmer on a de Bruijn graph.

source