Iterating over kmers

When introducing the Kmer type we described kmers as contiguous sub-strings of k nucleotides of some reference sequence.

This package therefore contains functionality for iterating over all the valid Kmers{A,K,N} in a longer BioSequence.

Kmers.EveryKmerType

An iterator over every valid overlapping T<:Kmer in a given longer BioSequence between a start and stop position.

Note

Typically, the alphabet of the Kmer type matches the alphabet of the input BioSequence. In these cases, the iterator will have Base.IteratorSize of Base.HasLength, and successive kmers produced by the iterator will overlap by K - 1 bases.

However, in the specific case of iterating over kmers in a DNA or RNA sequence, you may iterate over a Kmers where the alphabet is a NucleicAcidAlphabet{2}, but the input BioSequence has a NucleicAcidAlphabet{4}.

In this case then the iterator will skip over positions in the BioSequence with characters that are not supported by the Kmer type's NucleicAcidAlphabet{2}.

As a result, the overlap between successive kmers may not reliably be K - 1, and the iterator will have Base.IteratorSize of Base.SizeUnknown.

source
Kmers.SpacedKmersType

An iterator over every valid T<:Kmer separated by a step parameter, in a given longer BioSequence, between a start and stop position.

Note

Typically, the alphabet of the Kmer type matches the alphabet of the input BioSequence. In these cases, the iterator will have Base.IteratorSize of Base.HasLength, and successive kmers produced by the iterator will overlap by max(0, K - step) bases.

However, in the specific case of iterating over kmers in a DNA or RNA sequence, you may iterate over a Kmers where the alphabet is a NucleicAcidAlphabet{2}, but the input BioSequence has a NucleicAcidAlphabet{4}.

In this case then the iterator will skip over positions in the BioSequence with characters that are not supported by the Kmer type's NucleicAcidAlphabet{2}.

As a result, the overlap between successive kmers may not consistent, but the reading frame will be preserved. In addition, the iterator will have Base.IteratorSize of Base.SizeUnknown.

source
Kmers.EveryCanonicalKmerType

An iterator over every canonical valid overlapping T<:Kmer in a given longer BioSequence, between a start and stop position.

Note

Typically, the alphabet of the Kmer type matches the alphabet of the input BioSequence. In these cases, the iterator will have Base.IteratorSize of Base.HasLength, and successive kmers produced by the iterator will overlap by K - 1 bases.

However, in the specific case of iterating over kmers in a DNA or RNA sequence, you may iterate over a Kmers where the alphabet is a NucleicAcidAlphabet{2}, but the input BioSequence has a NucleicAcidAlphabet{4}.

In this case then the iterator will skip over positions in the BioSequence with characters that are not supported by the Kmer type's NucleicAcidAlphabet{2}.

As a result, the overlap between successive kmers may not reliably be K - 1, and the iterator will have Base.IteratorSize of Base.SizeUnknown.

source
Kmers.SpacedCanonicalKmersType

An iterator over every valid T<:Kmer separated by a step parameter, in a given longer BioSequence, between a start and stop position.

Note

Typically, the alphabet of the Kmer type matches the alphabet of the input BioSequence. In these cases, the iterator will have Base.IteratorSize of Base.HasLength, and successive kmers produced by the iterator will overlap by max(0, K - step) bases.

However, in the specific case of iterating over kmers in a DNA or RNA sequence, you may iterate over a Kmers where the alphabet is a NucleicAcidAlphabet{2}, but the input BioSequence has a NucleicAcidAlphabet{4}.

In this case then the iterator will skip over positions in the BioSequence with characters that are not supported by the Kmer type's NucleicAcidAlphabet{2}.

As a result, the overlap between successive kmers may not consistent, but the reading frame will be preserved. In addition, the iterator will have Base.IteratorSize of Base.SizeUnknown.

source