Iterating over kmers
When introducing the Kmer
type we described kmers as contiguous sub-strings of k nucleotides of some reference sequence.
This package therefore contains functionality for iterating over all the valid Kmers{A,K,N}
in a longer BioSequence
.
Kmers.EveryKmer
— TypeAn iterator over every valid overlapping T<:Kmer
in a given longer BioSequence
between a start
and stop
position.
Typically, the alphabet of the Kmer type matches the alphabet of the input BioSequence. In these cases, the iterator will have Base.IteratorSize
of Base.HasLength
, and successive kmers produced by the iterator will overlap by K - 1 bases.
However, in the specific case of iterating over kmers in a DNA or RNA sequence, you may iterate over a Kmers where the alphabet is a NucleicAcidAlphabet{2}, but the input BioSequence has a NucleicAcidAlphabet{4}.
In this case then the iterator will skip over positions in the BioSequence with characters that are not supported by the Kmer type's NucleicAcidAlphabet{2}.
As a result, the overlap between successive kmers may not reliably be K - 1, and the iterator will have Base.IteratorSize
of Base.SizeUnknown
.
Kmers.SpacedKmers
— TypeAn iterator over every valid T<:Kmer
separated by a step
parameter, in a given longer BioSequence
, between a start
and stop
position.
Typically, the alphabet of the Kmer type matches the alphabet of the input BioSequence. In these cases, the iterator will have Base.IteratorSize
of Base.HasLength
, and successive kmers produced by the iterator will overlap by max(0, K - step)
bases.
However, in the specific case of iterating over kmers in a DNA or RNA sequence, you may iterate over a Kmers where the alphabet is a NucleicAcidAlphabet{2}, but the input BioSequence has a NucleicAcidAlphabet{4}.
In this case then the iterator will skip over positions in the BioSequence with characters that are not supported by the Kmer type's NucleicAcidAlphabet{2}.
As a result, the overlap between successive kmers may not consistent, but the reading frame will be preserved. In addition, the iterator will have Base.IteratorSize
of Base.SizeUnknown
.
Kmers.EveryCanonicalKmer
— TypeAn iterator over every canonical valid overlapping T<:Kmer
in a given longer BioSequence
, between a start
and stop
position.
Typically, the alphabet of the Kmer type matches the alphabet of the input BioSequence. In these cases, the iterator will have Base.IteratorSize
of Base.HasLength
, and successive kmers produced by the iterator will overlap by K - 1 bases.
However, in the specific case of iterating over kmers in a DNA or RNA sequence, you may iterate over a Kmers where the alphabet is a NucleicAcidAlphabet{2}, but the input BioSequence has a NucleicAcidAlphabet{4}.
In this case then the iterator will skip over positions in the BioSequence with characters that are not supported by the Kmer type's NucleicAcidAlphabet{2}.
As a result, the overlap between successive kmers may not reliably be K - 1, and the iterator will have Base.IteratorSize
of Base.SizeUnknown
.
Kmers.SpacedCanonicalKmers
— TypeAn iterator over every valid T<:Kmer
separated by a step
parameter, in a given longer BioSequence
, between a start
and stop
position.
Typically, the alphabet of the Kmer type matches the alphabet of the input BioSequence. In these cases, the iterator will have Base.IteratorSize
of Base.HasLength
, and successive kmers produced by the iterator will overlap by max(0, K - step)
bases.
However, in the specific case of iterating over kmers in a DNA or RNA sequence, you may iterate over a Kmers where the alphabet is a NucleicAcidAlphabet{2}, but the input BioSequence has a NucleicAcidAlphabet{4}.
In this case then the iterator will skip over positions in the BioSequence with characters that are not supported by the Kmer type's NucleicAcidAlphabet{2}.
As a result, the overlap between successive kmers may not consistent, but the reading frame will be preserved. In addition, the iterator will have Base.IteratorSize
of Base.SizeUnknown
.