The Kmer type
The central type of Kmers.jl is the Kmer. A Kmer is an immutable, bitstype BioSequence, with a length known at compile time. Compared to LongSequence in BioSequences.jl, this gives to one advantage, and comes with two disadvantages:
- Kmers are much faster than
LongSequence, as they can be stored in registers. - As kmers gets longer, the code gets increasingly inefficient, as the unrolling and inlining of the immutable operations breaks down.
- Since their length is part of their type, any operation that results in a kmer whose length cannot be determined at compile time will be type unstable. This includes slicing a kmer, pushing and popping it, and other operations.
The Kmer type is (roughly) defined as
struct Kmer{A <: Alphabet, K, N} <: BioSequence{A}
x::NTuple{N, UInt}
endWhere:
Ais theAlphabetas defined in BioSequences.jl.Kis the length.Nis an extra type parameter derived from the first two used for the length of the tuple, which exists only because Julia does not allow computed type parameters.
Given a Kmer{A, K}, the N parameter may be computed using the function derive_type:
julia> derive_type(Kmer{AminoAcidAlphabet, 6})
Kmer{AminoAcidAlphabet, 6, 1}Construction
Kmers can be constructed from a BioSequence or AbstractString by explicitly specifying the length of the sequence:
julia> Kmer{DNAAlphabet{2}, 5, 1}("TAGCT")
DNA 5-mer:
TAGCTThe final type parameter N can be elided, in which case it will be inferred:
julia> Kmer{DNAAlphabet{2}, 5}("TAGCT")
DNA 5-mer:
TAGCTKmers with alphabets DNAAlphabet{2}, RNAAlphabet{2} and AminoAcidAlphabet can be created with the type aliases DNAKmer, RNAKmer and AAKmer:
julia> DNAKmer{3}("tag")
DNA 3-mer:
TAG
julia> AAKmer{5}("PWYSK")
AminoAcid 5-mer:
PWYSKFor kmers with an Alphabet that implement the trait BioSequences.AsciiAlphabet, they can also be constructed from AbstractVector{UInt8}, in which case the vector is interpreted as being bytes of ASCII text:
julia> AAKmer{3}([0x65, 0x67, 0x7a])
AminoAcid 3-mer:
EGZWhen constructing from an AbstractString (or byte vector), uracil (U) and thymine T are treated differently - a U cannot be read as thymine:
julia> DNAKmer{3}("UAG")
ERROR: cannot encode 0x55 (Char 'U') in DNAAlphabet{2}
[...]However, when constructing from a BioSequence, these nucleotides are considered interchangeable:
julia> RNAKmer{4}(dna"TATC")
RNA 4-mer:
UAUCFinally, kmers can be constructed with a string literal @mer_str, where the string must be appended with d for DNA, r for RNA, or a for amino acid:
julia> mer"UGCUGA"r
RNA 6-mer:
UGCUGA
julia> mer"EDEHL"a
AminoAcid 5-mer:
EDEHLSince the literals produce the kmer at parse time and inserts the kmer directly into the abstract syntax tree, this will always be type stable, and the overhead related to parsing the string will not be paid.
In the following example, each iteration takes less than 1 nanosecond, which implies that parsing the string literal mer"AAA"d is not done in the loop at runtime:
julia> function count_aaas(dna)
x = 0
for kmer in FwDNAMers{3}(dna)
# The parsing happens once here, when the
# code is parsed, and is fine to have in the loop
x += kmer == mer"AAA"d
end
x
end;
julia> seq = randseq(DNAAlphabet{2}(), 100_000_000);
julia> count_aaas(seq); # compile
julia> @time count_aaas(seq) # about 1ns per iteration
0.088556 seconds (1 allocation: 16 bytes)
1561361Indexing
Kmers support most normal indexing, such as scalar indexing:
julia> mer"CAGCU"r[3]
RNA_GSlicing
julia> mer"AGGCTA"d[2:5]
DNA 4-mer:
GGCTAnd indexing with boolean vectors, and vectors of indices:
julia> m = mer"MDGKRY"a;
julia> m[[true, false, true, true, false, true]]
AminoAcid 4-mer:
MGKY
julia> m[[4,2]]
AminoAcid 2-mer:
KDA note on type stability
Except scalar indexing which always returns a single symbol, all the operations above are type unstable, since the length (and thus type) of the resulting kmer depends on the input value, not its type.
However, type unstable functions may be type-stable, if the indexing value is known at compile time, and the Julia compiler uses constant folding:
julia> f(x) = x[2:5]; # 2:5 is a compile time constant
julia> Test.@inferred f(mer"UCGUAGC"r)
RNA 4-mer:
CGUAReference
Kmers.Kmer — TypeKmer{A<:Alphabet,K,N} <: BioSequence{A}An immutable bitstype for representing k-mers - short BioSequences of a fixed length K. Since they can be stored directly in registers, Kmers are generally the most efficient type of BioSequence, when K is small and known at compile time.
The N parameter is derived from A and K and is not a free parameter.
See also: DNAKmer, RNAKmer, AAKmer, AbstractKmerIterator
Examples
julia> RNAKmer{5}("ACGUC")
RNA 5-mer:
ACGUC
julia> Kmer{DNAAlphabet{4}, 6}(dna"TGCTTA")
DNA 6-mer:
TGCTTA
julia> AAKmer{5}((lowercase(i) for i in "KLWYR"))
AminoAcid 5-mer:
KLWYR
julia> RNAKmer{3}("UAUC") # wrong length
ERROR:
[...]Kmers.derive_type — Functionderive_type(::Type{Kmer{A, K}}) -> Type{Kmer{A, K, N}}Compute the fully parameterized kmer type from only the parameters A and K.
Kmers.Mer — TypeMer{K}Alias for Kmer{<:Alphabet, K}. Useful to dispatch on K-mers without regard for the alphabat
Example
julia> mer"DEKR"a isa Mer{4}
true
julia> DNAKmer{6}("TGATCA") isa Mer{6}
true
julia> RNACodon <: Mer{3}
trueKmers.@mer_str — Macro@mer_str -> KmerConstruct a Kmer from the given string. The macro must be used with a flag after the input string, e.g. d in mer"TAG"d or a in mer"PCW"a, signifying the alphabet of the kmer. The flags d = DNAAlphabet{2}, r = RNAAlphabet{2} and a = AminoAcidAlphabet are recognized.
Because the macro is resolved and the kmer is created at parse time, the macro is type stable, and may be used in high performance code.
Examples
julia> mer"UGCUA"r
RNA 5-mer:
UGCUA
julia> mer"YKVSTEDLLKKR"a
AminoAcid 12-mer:
YKVSTEDLLKKR
julia> mer"TATTAGCA"d
DNA 8-mer:
TATTAGCAKmers.DNAKmer — TypeAlias for Kmer{DNAAlphabet{2},K,N}
Kmers.RNAKmer — TypeAlias for Kmer{RNAAlphabet{2},K,N}
Kmers.AAKmer — TypeAlias for Kmer{AminoAcidAlphabet,K,N}
Kmers.DNACodon — TypeAlias for DNAKmer{3,1}
Kmers.RNACodon — TypeAlias for RNAKmer{3,1}
Kmers.as_integer — Functionas_integer(x::Kmer)::UnsignedGet the encoded integer representation of kmer x. The returned value is an unsigned integer between UInt8 and UInt128, with the smallest type chosen which can fit the coding bits. Throws an ArgumentError if passed kmers with more than 128 bits.
The value of the encoded representation is an implementation detail, and may change in minor versions of Kmers.jl. However, the value has the following guarantees:
- For a given kmer and version of Kmers.jl, the value is deterministic.
- If the alphabet has unique encodings for each symbol, then two kmers of the same length are guaranteed to have distinct encodings.
- The integer has no more bits than the total number of bits in the encoding of the kmer, and the smallest possible unsigned bitstype integer type is used.
Examples
julia> as_integer(mer"AACT"d)
0x07
julia> as_integer(mer"CT"d) # different K, same encoding
0x07
julia> as_integer(mer"KWPQHVY"a)
0x000b110e05081312
julia> as_integer(mer"VEEKEGVLIKLRK"a)
0x0000001306060b0607130a090b0a010b
julia> Kmers.as_integer(AAKmer{17}("A"^17))
ERROR: ArgumentError: Must have at most 128 bits in encoding
[...]Kmers.from_integer — Functionfrom_integer(T::Type{<:Kmer}, u)::TConstruct a kmer of type T from the unsigned bit-integer u. The value of the returned kmer cannot be relied on, but it is guaranteed that the roundtripping a value produced by as_integer will work, and will result in the same kmer.
T must have at most 128 bits coding bits, and u must have no more than 128 bits. Since not all bits of u may be used when constructing the kmer, different integers may return the same kmer.
Examples
julia> kmer = mer"TGATCGTAGAGTGTA"d;
julia> u = as_integer(kmer); typeof(u)
UInt32
julia> from_integer(typeof(kmer), u) === kmer
trueKmers.pop — Functionpop(kmer::Kmer{A, K})::Kmer{A, K-1}Returns a new kmer with the last symbol of the input kmer removed. Throws an ArgumentError if kmer is empty.
Since the output of this function is a K-1-mer, use of this function in a loop may result in type-instability.
See also: pop_first, push, shift
Examples
julia> pop(mer"TCTGTA"d)
DNA 5-mer:
TCTGT
julia> pop(mer"QPSY"a)
AminoAcid 3-mer:
QPS
julia> pop(mer""a)
ERROR: ArgumentError:
[...]Kmers.pop_first — Functionpop_first(kmer::Kmer{A, K})::Kmer{A, K-1}Returns a new kmer with the first symbol of the input kmer removed. Throws an ArgumentError if kmer is empty.
Since the output of this function is a K-1-mer, use of this function in a loop may result in type-instability.
Examples
julia> pop_first(mer"TCTGTA"d)
DNA 5-mer:
CTGTA
julia> pop_first(mer"QPSY"a)
AminoAcid 3-mer:
PSY
julia> pop_first(mer""a)
ERROR: ArgumentError:
[...]Kmers.push — Functionpush(kmer::Kmer{A, K}, s)::Kmer{A, K+1}Create a new kmer which is the concatenation of kmer and s. Returns a K+1-mer.
Since the output of this function is a K+1-mer, use of this function in a loop may result in type-instability.
See also: push_first, pop, shift
Examples
julia> push(mer"UGCUGA"r, RNA_G)
RNA 7-mer:
UGCUGAG
julia> push(mer"W"a, 'E')
AminoAcid 2-mer:
WEKmers.push_first — Functionpush_first(kmer::Kmer{A, K}, s)::Kmer{A, K+1}Create a new kmer which is the concatenation of s and kmer. Returns a K+1-mer. Similar to push, but places the new symbol s at the front.
Since the output of this function is a K+1-mer, use of this function in a loop may result in type-instability.
Examples
julia> push_first(mer"GCU"r, RNA_G)
RNA 4-mer:
GGCU
julia> push_first(mer"W"a, 'E')
AminoAcid 2-mer:
EWKmers.shift — Functionshift(kmer::Kmer{A, K}, s)::Kmer{A, K}Push symbol onto the end of kmer, and pop the first symbol in kmer. Unlike push, this preserves the input type, and is less likely to result in type instability.
See also: shift_first, push
Examples
julia> shift(mer"TACC"d, DNA_A)
DNA 4-mer:
ACCA
julia> shift(mer"WKYMLPIIRS"aa, 'F')
AminoAcid 10-mer:
KYMLPIIRSFKmers.shift_first — Functionshift_first(kmer::kmer, symbol)::typeof(kmer)Push symbol onto the start of kmer, and pop the last symbol in kmer.
Examples
julia> shift_first(mer"TACC"d, DNA_A)
DNA 4-mer:
ATAC
julia> shift_first(mer"WKYMLPIIRS"aa, 'F')
AminoAcid 10-mer:
FWKYMLPIIR