Iteration
Most applications of kmers extract multiple kmers from an underlying sequence. To facilitate this, Kmers.jl implements a few basic kmer iterators, most of which are subtypes of AbstractKmerIterator.
The underlying sequence can be a BioSequence, AbstractString, or AbstractVector{UInt8}. In the latter case, if the alphabet of the element type implements BioSequences.AsciiAlphabet, the vector will be treated as a vector of ASCII characters.
Similarly to the rules when constructing kmers directly, DNA and RNA is treated interchangeably when the underlying sequence is a BioSequence, but when the underlying sequence is a string or bytevector, U and T are considered different, and e.g. uracil cannot be constructed from a sequence containing T:
julia> only(FwDNAMers{3}(rna"UGU"))
DNA 3-mer:
TGT
julia> only(FwDNAMers{3}("UGU"))
ERROR:
[...]The following kmer iterators are implemented:
FwKmers
The most basic kmer iterator is FwKmers, which simply iterates every kmer, in order:
Kmers.FwKmers — TypeFwKmers{A <: Alphabet, K, S} <: AbstractKmerIterator{A, K}Iterator of forward kmers. S signifies the type of the underlying sequence, and the eltype of the iterator is Kmer{A, K, N} with the appropriate N. The elements in a FwKmers{A, K, S}(s::S) correspond to all the Kmer{A, K} in s, in order.
Can be constructed more conventiently with the constructors FwDNAMers{K}(s) and similar also for FwRNAMers and FwAAMers.
Examples:
julia> s = "AGCGTATA";
julia> v = collect(FwDNAMers{3}(s));
julia> v == [DNAKmer{3}(s[i:i+2]) for i in 1:length(s)-2]
true
julia> eltype(v), length(v)
(Kmer{DNAAlphabet{2}, 3, 1}, 6)
julia> collect(FwRNAMers{3}(rna"UGCDUGAVC"))
ERROR: cannot encode D in RNAAlphabet{2}Kmers.FwDNAMers — TypeFwDNAMers{K, S}: Alias for FwKmers{DNAAlphabet{2}, K, S}
Kmers.FwRNAMers — TypeFwRNAMers{K, S}: Alias for FwKmers{RNAAlphabet{2}, K, S}
Kmers.FwAAMers — TypeFwAAMers{K, S}: Alias for FwKmers{AminoAcidAlphabet, K, S}
FwRvIterator
This iterates over a nucleic acid sequence. For every kmer it encounters, it outputs the kmer and its reverse complement.
Kmers.FwRvIterator — TypeFwRvIterator{A <: NucleicAcidAlphabet, K, S}Iterates 2-tuples of (forward, reverse_complement) of every kmer of type Kmer{A, K} from the underlying sequence, in order. S signifies the type of the underlying sequence. This is typically more efficient than iterating over a FwKmers and computing reverse_complement on every element.
See also: FwKmers, CanonicalKmers
Examples:
julia> collect(FwRvIterator{DNAAlphabet{4}, 3}("AGCGT"))
3-element Vector{Tuple{Mer{3, DNAAlphabet{4}, 1}, Mer{3, DNAAlphabet{4}, 1}}}:
(AGC, GCT)
(GCG, CGC)
(CGT, ACG)
julia> collect(FwRvIterator{DNAAlphabet{2}, 3}("AGNGT"))
ERROR: cannot encode 0x4e (Char 'N') in DNAAlphabet{2}
[...]CanonicalKmers
This iterator is similar to FwKmers, however, for each Kmer encountered, it returns the canonical kmer.
The canonical kmer is defined as the lexographically smaller of a kmer and its reverse complement. That is, if FwKmers would iterate TCAC, then CanonicalKmers would return GTGA, as this is the reverse complement of TCAC, and is before TCAC in the alphabet.
CanonicalKmers is useful for summarizing the kmer composition of sequences whose strandedness is unknown.
Kmers.CanonicalKmers — TypeCanonicalKmers{A <: NucleicAcidAlphabet, K, S} <: AbstractKmerIterator{A, K}Iterator of canonical nucleic acid kmers. The result of this iterator is equivalent to calling canonical on each value of a FwKmers iterator, but may be more efficient.
When counting small kmers, it may be more efficient to count FwKmers, then call canonical only once per unique kmer.
Can be constructed more conventiently with the constructors CanonicalDNAMers{K}(s) CanonicalRNAMers{K}(s)
Examples:
julia> collect(CanonicalRNAMers{3}("AGCGA"))
3-element Vector{Kmer{RNAAlphabet{2}, 3, 1}}:
AGC
CGC
CGAKmers.CanonicalDNAMers — TypeCanonicalDNAMers{K, S}: Alias for CanonicalKmers{DNAAlphabet{2}, K, S}
Kmers.CanonicalRNAMers — TypeCanonicalRNAMers{K, S}: Alias for CanonicalKmers{RNAAlphabet{2}, K, S}
UnambiguousKmers
UnambiguousKmers iterates unambiguous nucleotides (that is, kmers of the alphabets DNAAlphabet{2} or RNAAlphabet{2}). Any kmers containing ambiguous nucleotides such as W or N are skipped.
Kmers.UnambiguousKmers — TypeUnambiguousKmers{A <: TwoBit, K, S}Iterator of (kmer, index), where kmer are 2-bit nucleic acid kmers in the underlying sequence, and index::Int the starting position of the kmer in the sequence. The extracted kmers differ from those of FwKmers in that any kmers containing ambiguous nucleotides are skipped, whereas using FwKmers, encountering unambiguous nucleotides result in an error.
This iterator can be constructed more conventiently with the constructors UnambiguousDNAMers{K}(s) and UnambiguousRNAMers{K}(s).
To obtain canonical unambiguous kmers, simply call canonical on each kmer output by UnambiguousKmers.
Examples:
julia> it = UnambiguousRNAMers{4}(dna"TGAGCWKCATC");
julia> collect(it)
3-element Vector{Tuple{Kmer{RNAAlphabet{2}, 4, 1}, Int64}}:
(UGAG, 1)
(GAGC, 2)
(CAUC, 8)Kmers.UnambiguousDNAMers — TypeUnambiguousDNAMers{K, S}: Alias for UnambiguousKmers{DNAAlphabet{2}, K, S}
Kmers.UnambiguousRNAMers — TypeUnambiguousRNAMers{K, S}: Alias for UnambiguousKmers{RNAAlphabet{2}, K, S}
SpacedKmers
The SpacedKmers iterator iterates kmers with a fixed step size between k-mers. For example, for a K of 4, and a step size of 3, the output kmers would overlap with a single nucleotide, like so:
seq: TGATGCGTAGTG
TGCT
TGCG
GTAGHence, if FwKmers are analogous to UnitRange, SpacedKmers is analogous to StepRange.
Kmers.SpacedKmers — TypeSpacedKmers{A <: Alphabet, K, J, S} <: AbstractKmerIterator{A, K}Iterator of kmers with step size. J signifies the step size, S the type of the underlying sequence, and the eltype of the iterator is Kmer{A, K, N} with the appropriate N.
For example, a SpacedKmers{AminoAcidAlphabet, 3, 5, Vector{UInt8}} sampling over seq::Vector{UInt8} will sample all kmers corresponding to seq[1:3], seq[6:8], seq[11:13] etc.
See also: each_codon, FwKmers
Examples:
julia> collect(SpacedDNAMers{3, 2}("AGCGTATA"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
AGC
CGT
TATKmers.SpacedDNAMers — TypeSpacedDNAMers{K, J, S}: Alias for SpacedKmers{DNAAlphabet{2}, K, J, S}
Kmers.SpacedRNAMers — TypeSpacedRNAMers{K, J, S}: Alias for SpacedKmers{RNAAlphabet{2}, K, J, S}
Kmers.SpacedAAMers — TypeSpacedAAMers{K, J, S}: Alias for SpacedKmers{AminoAcidAlphabet, K, J, S}
The convenience functions each_codon return SpacedKmers with a K value of 3 and step size of 3:
Kmers.each_codon — Functioneach_codon(s::BioSequence{<:Union{DNAAlphabet, RNAAlphabet}})
each_codon(::Type{<:Union{DNA, RNA}}, s)Construct an iterator of nucleotide 3-mers with step size 3 from s. The sequence s may be an RNA or DNA biosequence, in which case the element type is inferred, or the element type may be specified explicitly, in which case s may be a byte-like sequence such as a String or Vector{UInt8}.
This function returns SpacedKmers iterator.
See also: SpacedKmers
Examples:
julia> collect(each_codon(DNA, "TGACGATCGAC"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
TGA
CGA
TCGThe AbstractKmerIterator interface
It's very likely that users of Kmers.jl need to implement their own custom kmer iterators, in which case they should subtype AbstractKmerIterator.
Kmers.AbstractKmerIterator — TypeAbstractKmerIterator{A <: Alphabet, K}Abstract type for kmer iterators. The element type is Kmer{A, K, N}, with the appropriately derived N.
Functions to implement:
Base.iterateBase.lengthorBase.IteratorSizeif notHasLength
At the moment, there is no real interface implemented for this abstract type, other than that AbstractKmerIterator{A, K} needs to iterate Kmer{A, K}.