Iteration

Most applications of kmers extract multiple kmers from an underlying sequence. To facilitate this, Kmers.jl implements a few basic kmer iterators, most of which are subtypes of AbstractKmerIterator.

The underlying sequence can be a BioSequence, AbstractString, or AbstractVector{UInt8}. In the latter case, if the alphabet of the element type implements BioSequences.AsciiAlphabet, the vector will be treated as a vector of ASCII characters.

Similarly to the rules when constructing kmers directly, DNA and RNA is treated interchangeably when the underlying sequence is a BioSequence, but when the underlying sequence is a string or bytevector, U and T are considered different, and e.g. uracil cannot be constructed from a sequence containing T:

julia> only(FwDNAMers{3}(rna"UGU"))
DNA 3-mer:
TGT

julia> only(FwDNAMers{3}("UGU"))
ERROR:
[...]

The following kmer iterators are implemented:

`FwKmers`

The most basic kmer iterator is FwKmers, which simply iterates every kmer, in order:

Kmers.FwKmers — Type

FwKmers{A <: Alphabet, K, S} <: AbstractKmerIterator{A, K}

Iterator of forward kmers. S signifies the type of the underlying sequence, and the eltype of the iterator is Kmer{A, K, N} with the appropriate N. The elements in a FwKmers{A, K, S}(s::S) correspond to all the Kmer{A, K} in s, in order.

Can be constructed more conventiently with the constructors FwDNAMers{K}(s) and similar also for FwRNAMers and FwAAMers.

Examples:

julia> s = "AGCGTATA";

julia> v = collect(FwDNAMers{3}(s));

julia> v == [DNAKmer{3}(s[i:i+2]) for i in 1:length(s)-2]
true

julia> eltype(v), length(v)
(Kmer{DNAAlphabet{2}, 3, 1}, 6)

julia> collect(FwRNAMers{3}(rna"UGCDUGAVC"))
ERROR: cannot encode D in RNAAlphabet{2}

source

Kmers.FwDNAMers — Type

FwDNAMers{K, S}: Alias for FwKmers{DNAAlphabet{2}, K, S}

source

Kmers.FwRNAMers — Type

FwRNAMers{K, S}: Alias for FwKmers{RNAAlphabet{2}, K, S}

source

Kmers.FwAAMers — Type

FwAAMers{K, S}: Alias for FwKmers{AminoAcidAlphabet, K, S}

source

`FwRvIterator`

This iterates over a nucleic acid sequence. For every kmer it encounters, it outputs the kmer and its reverse complement.

Kmers.FwRvIterator — Type

FwRvIterator{A <: NucleicAcidAlphabet, K, S}

Iterates 2-tuples of (forward, reverse_complement) of every kmer of type Kmer{A, K} from the underlying sequence, in order. S signifies the type of the underlying sequence. This is typically more efficient than iterating over a FwKmers and computing reverse_complement on every element.

`CanonicalKmers`

This iterator is similar to FwKmers, however, for each Kmer encountered, it returns the canonical kmer.

The canonical kmer is defined as the lexographically smaller of a kmer and its reverse complement. That is, if FwKmers would iterate TCAC, then CanonicalKmers would return GTGA, as this is the reverse complement of TCAC, and is before TCAC in the alphabet.

CanonicalKmers is useful for summarizing the kmer composition of sequences whose strandedness is unknown.

Kmers.CanonicalKmers — Type

CanonicalKmers{A <: NucleicAcidAlphabet, K, S} <: AbstractKmerIterator{A, K}

Iterator of canonical nucleic acid kmers. The result of this iterator is equivalent to calling canonical on each value of a FwKmers iterator, but may be more efficient.

Note

When counting small kmers, it may be more efficient to count FwKmers, then call canonical only once per unique kmer.

Can be constructed more conventiently with the constructors CanonicalDNAMers{K}(s) CanonicalRNAMers{K}(s)

Examples:

julia> collect(CanonicalRNAMers{3}("AGCGA"))
3-element Vector{Kmer{RNAAlphabet{2}, 3, 1}}:
 AGC
 CGC
 CGA

source

Kmers.CanonicalDNAMers — Type

CanonicalDNAMers{K, S}: Alias for CanonicalKmers{DNAAlphabet{2}, K, S}

source

Kmers.CanonicalRNAMers — Type

CanonicalRNAMers{K, S}: Alias for CanonicalKmers{RNAAlphabet{2}, K, S}

source

`UnambiguousKmers`

UnambiguousKmers iterates unambiguous nucleotides (that is, kmers of the alphabets DNAAlphabet{2} or RNAAlphabet{2}). Any kmers containing ambiguous nucleotides such as W or N are skipped.

Kmers.UnambiguousKmers — Type

UnambiguousKmers{A <: TwoBit, K, S}

Iterator of (kmer, index), where kmer are 2-bit nucleic acid kmers in the underlying sequence, and index::Int the starting position of the kmer in the sequence. The extracted kmers differ from those of FwKmers in that any kmers containing ambiguous nucleotides are skipped, whereas using FwKmers, encountering unambiguous nucleotides result in an error.

This iterator can be constructed more conventiently with the constructors UnambiguousDNAMers{K}(s) and UnambiguousRNAMers{K}(s).

Note

To obtain canonical unambiguous kmers, simply call canonical on each kmer output by UnambiguousKmers.

Examples:

julia> it = UnambiguousRNAMers{4}(dna"TGAGCWKCATC");

julia> collect(it)
3-element Vector{Tuple{Kmer{RNAAlphabet{2}, 4, 1}, Int64}}:
 (UGAG, 1)
 (GAGC, 2)
 (CAUC, 8)

source

Kmers.UnambiguousDNAMers — Type

UnambiguousDNAMers{K, S}: Alias for UnambiguousKmers{DNAAlphabet{2}, K, S}

source

Kmers.UnambiguousRNAMers — Type

UnambiguousRNAMers{K, S}: Alias for UnambiguousKmers{RNAAlphabet{2}, K, S}

source

`SpacedKmers`

The SpacedKmers iterator iterates kmers with a fixed step size between k-mers. For example, for a K of 4, and a step size of 3, the output kmers would overlap with a single nucleotide, like so:

seq: TGATGCGTAGTG
     TGCT
        TGCG
           GTAG

Hence, if FwKmers are analogous to UnitRange, SpacedKmers is analogous to StepRange.

Kmers.SpacedKmers — Type

SpacedKmers{A <: Alphabet, K, J, S} <: AbstractKmerIterator{A, K}

Iterator of kmers with step size. J signifies the step size, S the type of the underlying sequence, and the eltype of the iterator is Kmer{A, K, N} with the appropriate N.

For example, a SpacedKmers{AminoAcidAlphabet, 3, 5, Vector{UInt8}} sampling over seq::Vector{UInt8} will sample all kmers corresponding to seq[1:3], seq[6:8], seq[11:13] etc.

The `AbstractKmerIterator` interface

It's very likely that users of Kmers.jl need to implement their own custom kmer iterators, in which case they should subtype AbstractKmerIterator.

Kmers.AbstractKmerIterator — Type

AbstractKmerIterator{A <: Alphabet, K}

Abstract type for kmer iterators. The element type is Kmer{A, K, N}, with the appropriately derived N.

Functions to implement:

Base.iterate
Base.length or Base.IteratorSize if not HasLength

source

At the moment, there is no real interface implemented for this abstract type, other than that AbstractKmerIterator{A, K} needs to iterate Kmer{A, K}.

Iteration

FwKmers

FwRvIterator

CanonicalKmers

UnambiguousKmers

SpacedKmers

The AbstractKmerIterator interface

`FwKmers`

`FwRvIterator`

`CanonicalKmers`

`UnambiguousKmers`

`SpacedKmers`

The `AbstractKmerIterator` interface