Iteration
Most applications of kmers extract multiple kmers from an underlying sequence. To facilitate this, Kmers.jl implements a few basic kmer iterators, most of which are subtypes of AbstractKmerIterator
.
The underlying sequence can be a BioSequence
, AbstractString
, or AbstractVector{UInt8}
. In the latter case, if the alphabet of the element type implements BioSequences.AsciiAlphabet
, the vector will be treated as a vector of ASCII characters.
Similarly to the rules when constructing kmers directly, DNA and RNA is treated interchangeably when the underlying sequence is a BioSequence
, but when the underlying sequence is a string or bytevector, U
and T
are considered different, and e.g. uracil cannot be constructed from a sequence containing T
:
julia> only(FwDNAMers{3}(rna"UGU"))
DNA 3-mer:
TGT
julia> only(FwDNAMers{3}("UGU"))
ERROR:
[...]
The following kmer iterators are implemented:
FwKmers
The most basic kmer iterator is FwKmers
, which simply iterates every kmer, in order:
Kmers.FwKmers
— TypeFwKmers{A <: Alphabet, K, S} <: AbstractKmerIterator{A, K}
Iterator of forward kmers. S
signifies the type of the underlying sequence, and the eltype of the iterator is Kmer{A, K, N}
with the appropriate N
. The elements in a FwKmers{A, K, S}(s::S)
correspond to all the Kmer{A, K}
in s
, in order.
Can be constructed more conventiently with the constructors FwDNAMers{K}(s)
and similar also for FwRNAMers
and FwAAMers
.
Examples:
julia> s = "AGCGTATA";
julia> v = collect(FwDNAMers{3}(s));
julia> v == [DNAKmer{3}(s[i:i+2]) for i in 1:length(s)-2]
true
julia> eltype(v), length(v)
(Kmer{DNAAlphabet{2}, 3, 1}, 6)
julia> collect(FwRNAMers{3}(rna"UGCDUGAVC"))
ERROR: cannot encode D in RNAAlphabet{2}
Kmers.FwDNAMers
— TypeFwDNAMers{K, S}
: Alias for FwKmers{DNAAlphabet{2}, K, S}
Kmers.FwRNAMers
— TypeFwRNAMers{K, S}
: Alias for FwKmers{RNAAlphabet{2}, K, S}
Kmers.FwAAMers
— TypeFwAAMers{K, S}
: Alias for FwKmers{AminoAcidAlphabet, K, S}
FwRvIterator
This iterates over a nucleic acid sequence. For every kmer it encounters, it outputs the kmer and its reverse complement.
Kmers.FwRvIterator
— TypeFwRvIterator{A <: NucleicAcidAlphabet, K, S}
Iterates 2-tuples of (forward, reverse_complement)
of every kmer of type Kmer{A, K}
from the underlying sequence, in order. S
signifies the type of the underlying sequence. This is typically more efficient than iterating over a FwKmers
and computing reverse_complement
on every element.
See also: FwKmers
, CanonicalKmers
Examples:
julia> collect(FwRvIterator{DNAAlphabet{4}, 3}("AGCGT"))
3-element Vector{Tuple{Mer{3, DNAAlphabet{4}, 1}, Mer{3, DNAAlphabet{4}, 1}}}:
(AGC, GCT)
(GCG, CGC)
(CGT, ACG)
julia> collect(FwRvIterator{DNAAlphabet{2}, 3}("AGNGT"))
ERROR: cannot encode 0x4e (Char 'N') in DNAAlphabet{2}
[...]
CanonicalKmers
This iterator is similar to FwKmers
, however, for each Kmer
encountered, it returns the canonical kmer.
The canonical kmer is defined as the lexographically smaller of a kmer and its reverse complement. That is, if FwKmers
would iterate TCAC
, then CanonicalKmers
would return GTGA
, as this is the reverse complement of TCAC
, and is before TCAC
in the alphabet.
CanonicalKmers
is useful for summarizing the kmer composition of sequences whose strandedness is unknown.
Kmers.CanonicalKmers
— TypeCanonicalKmers{A <: NucleicAcidAlphabet, K, S} <: AbstractKmerIterator{A, K}
Iterator of canonical nucleic acid kmers. The result of this iterator is equivalent to calling canonical
on each value of a FwKmers
iterator, but may be more efficient.
When counting small kmers, it may be more efficient to count FwKmers
, then call canonical
only once per unique kmer.
Can be constructed more conventiently with the constructors CanonicalDNAMers{K}(s)
CanonicalRNAMers{K}(s)
Examples:
julia> collect(CanonicalRNAMers{3}("AGCGA"))
3-element Vector{Kmer{RNAAlphabet{2}, 3, 1}}:
AGC
CGC
CGA
Kmers.CanonicalDNAMers
— TypeCanonicalDNAMers{K, S}
: Alias for CanonicalKmers{DNAAlphabet{2}, K, S}
Kmers.CanonicalRNAMers
— TypeCanonicalRNAMers{K, S}
: Alias for CanonicalKmers{RNAAlphabet{2}, K, S}
UnambiguousKmers
UnambiguousKmers
iterates unambiguous nucleotides (that is, kmers of the alphabets DNAAlphabet{2}
or RNAAlphabet{2}
). Any kmers containing ambiguous nucleotides such as W
or N
are skipped.
Kmers.UnambiguousKmers
— TypeUnambiguousKmers{A <: TwoBit, K, S}
Iterator of (kmer, index)
, where kmer
are 2-bit nucleic acid kmers in the underlying sequence, and index::Int
the starting position of the kmer in the sequence. The extracted kmers differ from those of FwKmers
in that any kmers containing ambiguous nucleotides are skipped, whereas using FwKmers
, encountering unambiguous nucleotides result in an error.
This iterator can be constructed more conventiently with the constructors UnambiguousDNAMers{K}(s)
and UnambiguousRNAMers{K}(s)
.
To obtain canonical unambiguous kmers, simply call canonical
on each kmer output by UnambiguousKmers
.
Examples:
julia> it = UnambiguousRNAMers{4}(dna"TGAGCWKCATC");
julia> collect(it)
3-element Vector{Tuple{Kmer{RNAAlphabet{2}, 4, 1}, Int64}}:
(UGAG, 1)
(GAGC, 2)
(CAUC, 8)
Kmers.UnambiguousDNAMers
— TypeUnambiguousDNAMers{K, S}
: Alias for UnambiguousKmers{DNAAlphabet{2}, K, S}
Kmers.UnambiguousRNAMers
— TypeUnambiguousRNAMers{K, S}
: Alias for UnambiguousKmers{RNAAlphabet{2}, K, S}
SpacedKmers
The SpacedKmers
iterator iterates kmers with a fixed step size between k-mers. For example, for a K of 4, and a step size of 3, the output kmers would overlap with a single nucleotide, like so:
seq: TGATGCGTAGTG
TGCT
TGCG
GTAG
Hence, if FwKmers
are analogous to UnitRange
, SpacedKmers
is analogous to StepRange
.
Kmers.SpacedKmers
— TypeSpacedKmers{A <: Alphabet, K, J, S} <: AbstractKmerIterator{A, K}
Iterator of kmers with step size. J
signifies the step size, S
the type of the underlying sequence, and the eltype of the iterator is Kmer{A, K, N}
with the appropriate N
.
For example, a SpacedKmers{AminoAcidAlphabet, 3, 5, Vector{UInt8}}
sampling over seq::Vector{UInt8}
will sample all kmers corresponding to seq[1:3], seq[6:8], seq[11:13]
etc.
See also: each_codon
, FwKmers
Examples:
julia> collect(SpacedDNAMers{3, 2}("AGCGTATA"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
AGC
CGT
TAT
Kmers.SpacedDNAMers
— TypeSpacedDNAMers{K, J, S}
: Alias for SpacedKmers{DNAAlphabet{2}, K, J, S}
Kmers.SpacedRNAMers
— TypeSpacedRNAMers{K, J, S}
: Alias for SpacedKmers{RNAAlphabet{2}, K, J, S}
Kmers.SpacedAAMers
— TypeSpacedAAMers{K, J, S}
: Alias for SpacedKmers{AminoAcidAlphabet, K, J, S}
The convenience functions each_codon
return SpacedKmers
with a K value of 3 and step size of 3:
Kmers.each_codon
— Functioneach_codon(s::BioSequence{<:Union{DNAAlphabet, RNAAlphabet}})
each_codon(::Type{<:Union{DNA, RNA}}, s)
Construct an iterator of nucleotide 3-mers with step size 3 from s
. The sequence s
may be an RNA or DNA biosequence, in which case the element type is inferred, or the element type may be specified explicitly, in which case s
may be a byte-like sequence such as a String
or Vector{UInt8}
.
This function returns SpacedKmers
iterator.
See also: SpacedKmers
Examples:
julia> collect(each_codon(DNA, "TGACGATCGAC"))
3-element Vector{Kmer{DNAAlphabet{2}, 3, 1}}:
TGA
CGA
TCG
The AbstractKmerIterator
interface
It's very likely that users of Kmers.jl need to implement their own custom kmer iterators, in which case they should subtype AbstractKmerIterator
.
Kmers.AbstractKmerIterator
— TypeAbstractKmerIterator{A <: Alphabet, K}
Abstract type for kmer iterators. The element type is Kmer{A, K, N}
, with the appropriately derived N.
Functions to implement:
Base.iterate
Base.length
orBase.IteratorSize
if notHasLength
At the moment, there is no real interface implemented for this abstract type, other than that AbstractKmerIterator{A, K}
needs to iterate Kmer{A, K}
.