Iteration

Most applications of kmers extract multiple kmers from an underlying sequence. To facilitate this, Kmers.jl implements a few basic kmer iterators, most of which are subtypes of AbstractKmerIterator.

The underlying sequence can be a BioSequence, AbstractString, or AbstractVector{UInt8}. In the latter case, if the alphabet of the element type implements BioSequences.AsciiAlphabet, the vector will be treated as a vector of ASCII characters.

Similarly to the rules when constructing kmers directly, DNA and RNA is treated interchangeably when the underlying sequence is a BioSequence, but when the underlying sequence is a string or bytevector, U and T are considered different, and e.g. uracil cannot be constructed from a sequence containing T:

julia> only(FwDNAMers{3}(rna"UGU"))
DNA 3-mer:
TGT

julia> only(FwDNAMers{3}("UGU"))
ERROR:
[...]

The following kmer iterators are implemented:

FwKmers

The most basic kmer iterator is FwKmers, which simply iterates every kmer, in order:

Kmers.FwKmersType
FwKmers{A <: Alphabet, K, S} <: AbstractKmerIterator{A, K}

Iterator of forward kmers. S signifies the type of the underlying sequence, and the eltype of the iterator is Kmer{A, K, N} with the appropriate N. The elements in a FwKmers{A, K, S}(s::S) correspond to all the Kmer{A, K} in s, in order.

Can be constructed more conventiently with the constructors FwDNAMers{K}(s) and similar also for FwRNAMers and FwAAMers.

Examples:

julia> s = "AGCGTATA";

julia> v = collect(FwDNAMers{3}(s));

julia> v == [DNAKmer{3}(s[i:i+2]) for i in 1:length(s)-2]
true

julia> eltype(v), length(v)
(Kmer{DNAAlphabet{2}, 3, 1}, 6)

julia> collect(FwRNAMers{3}(rna"UGCDUGAVC"))
ERROR: cannot encode D in RNAAlphabet{2}
source

FwRvIterator

This iterates over a nucleic acid sequence. For every kmer it encounters, it outputs the kmer and its reverse complement.

Kmers.FwRvIteratorType
FwRvIterator{A <: NucleicAcidAlphabet, K, S}

Iterates 2-tuples of (forward, reverse_complement) of every kmer of type Kmer{A, K} from the underlying sequence, in order. S signifies the type of the underlying sequence. This is typically more efficient than iterating over a FwKmers and computing reverse_complement on every element.

See also: FwKmers, CanonicalKmers

Examples:

julia> collect(FwRvIterator{DNAAlphabet{4}, 3}("AGCGT"))
3-element Vector{Tuple{Mer{3, DNAAlphabet{4}, 1}, Mer{3, DNAAlphabet{4}, 1}}}:
 (AGC, GCT)
 (GCG, CGC)
 (CGT, ACG)

julia> collect(FwRvIterator{DNAAlphabet{2}, 3}("AGNGT"))
ERROR: cannot encode 0x4e (Char 'N') in DNAAlphabet{2}
[...]
source

CanonicalKmers

This iterator is similar to FwKmers, however, for each Kmer encountered, it returns the canonical kmer.

The canonical kmer is defined as the lexographically smaller of a kmer and its reverse complement. That is, if FwKmers would iterate TCAC, then CanonicalKmers would return GTGA, as this is the reverse complement of TCAC, and is before TCAC in the alphabet.

CanonicalKmers is useful for summarizing the kmer composition of sequences whose strandedness is unknown.

Kmers.CanonicalKmersType
CanonicalKmers{A <: NucleicAcidAlphabet, K, S} <: AbstractKmerIterator{A, K}

Iterator of canonical nucleic acid kmers. The result of this iterator is equivalent to calling canonical on each value of a FwKmers iterator, but may be more efficient.

Note

When counting small kmers, it may be more efficient to count FwKmers, then call canonical only once per unique kmer.

Can be constructed more conventiently with the constructors CanonicalDNAMers{K}(s) CanonicalRNAMers{K}(s)

Examples:

julia> collect(CanonicalRNAMers{3}("AGCGA"))
3-element Vector{Kmer{RNAAlphabet{2}, 3, 1}}:
 AGC
 CGC
 CGA
source

UnambiguousKmers

UnambiguousKmers iterates unambiguous nucleotides (that is, kmers of the alphabets DNAAlphabet{2} or RNAAlphabet{2}). Any kmers containing ambiguous nucleotides such as W or N are skipped.

Kmers.UnambiguousKmersType
UnambiguousKmers{A <: TwoBit, K, S}

Iterator of (kmer, index), where kmer are 2-bit nucleic acid kmers in the underlying sequence, and index::Int the starting position of the kmer in the sequence. The extracted kmers differ from those of FwKmers in that any kmers containing ambiguous nucleotides are skipped, whereas using FwKmers, encountering unambiguous nucleotides result in an error.

This iterator can be constructed more conventiently with the constructors UnambiguousDNAMers{K}(s) and UnambiguousRNAMers{K}(s).

Note

To obtain canonical unambiguous kmers, simply call canonical on each kmer output by UnambiguousKmers.

Examples:

julia> it = UnambiguousRNAMers{4}(dna"TGAGCWKCATC");

julia> collect(it)
3-element Vector{Tuple{Kmer{RNAAlphabet{2}, 4, 1}, Int64}}:
 (UGAG, 1)
 (GAGC, 2)
 (CAUC, 8)
source

SpacedKmers

The SpacedKmers iterator iterates kmers with a fixed step size between k-mers. For example, for a K of 4, and a step size of 3, the output kmers would overlap with a single nucleotide, like so:

seq: TGATGCGTAGTG
     TGCT
        TGCG
           GTAG

Hence, if FwKmers are analogous to UnitRange, SpacedKmers is analogous to StepRange.

Kmers.SpacedKmersType
SpacedKmers{A <: Alphabet, K, J, S} <: AbstractKmerIterator{A, K}

Iterator of kmers with step size. J signifies the step size, S the type of the underlying sequence, and the eltype of the iterator is Kmer{A, K, N} with the appropriate N.

For example, a SpacedKmers{AminoAcidAlphabet, 3, 5, Vector{UInt8}} sampling over seq::Vector{UInt8} will sample all kmers corresponding to seq[1:3], seq[6:8], seq[11:13] etc.

See also: each_codon, FwKmers

Examples:

julia> collect(SpacedDNAMers{3, 2}("AGCGTATA"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
 AGC
 CGT
 TAT
source

The convenience functions each_codon return SpacedKmers with a K value of 3 and step size of 3:

Kmers.each_codonFunction
each_codon(s::BioSequence{<:Union{DNAAlphabet, RNAAlphabet}})
each_codon(::Type{<:Union{DNA, RNA}}, s)

Construct an iterator of nucleotide 3-mers with step size 3 from s. The sequence s may be an RNA or DNA biosequence, in which case the element type is inferred, or the element type may be specified explicitly, in which case s may be a byte-like sequence such as a String or Vector{UInt8}.

This function returns SpacedKmers iterator.

See also: SpacedKmers

Examples:

julia> collect(each_codon(DNA, "TGACGATCGAC"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
 TGA
 CGA
 TCG
source

The AbstractKmerIterator interface

It's very likely that users of Kmers.jl need to implement their own custom kmer iterators, in which case they should subtype AbstractKmerIterator.

Kmers.AbstractKmerIteratorType
AbstractKmerIterator{A <: Alphabet, K}

Abstract type for kmer iterators. The element type is Kmer{A, K, N}, with the appropriately derived N.

Functions to implement:

  • Base.iterate
  • Base.length or Base.IteratorSize if not HasLength
source

At the moment, there is no real interface implemented for this abstract type, other than that AbstractKmerIterator{A, K} needs to iterate Kmer{A, K}.