The Kmer
type
The central type of Kmers.jl is the Kmer
. A Kmer
is an immutable, bitstype BioSequence
, with a length known at compile time. Compared to LongSequence
in BioSequences.jl, this gives to one advantage, and comes with two disadvantages:
- Kmers are much faster than
LongSequence
, as they can be stored in registers. - As kmers gets longer, the code gets increasingly inefficient, as the unrolling and inlining of the immutable operations breaks down.
- Since their length is part of their type, any operation that results in a kmer whose length cannot be determined at compile time will be type unstable. This includes slicing a kmer, pushing and popping it, and other operations.
The Kmer
type is (roughly) defined as
struct Kmer{A <: Alphabet, K, N} <: BioSequence{A}
x::NTuple{N, UInt}
end
Where:
A
is theAlphabet
as defined in BioSequences.jl.K
is the length.N
is an extra type parameter derived from the first two used for the length of the tuple, which exists only because Julia does not allow computed type parameters.
Given a Kmer{A, K}
, the N
parameter may be computed using the function derive_type
:
julia> derive_type(Kmer{AminoAcidAlphabet, 6})
Kmer{AminoAcidAlphabet, 6, 1}
Construction
Kmers can be constructed from a BioSequence
or AbstractString
by explicitly specifying the length of the sequence:
julia> Kmer{DNAAlphabet{2}, 5, 1}("TAGCT")
DNA 5-mer:
TAGCT
The final type parameter N
can be elided, in which case it will be inferred:
julia> Kmer{DNAAlphabet{2}, 5}("TAGCT")
DNA 5-mer:
TAGCT
Kmers with alphabets DNAAlphabet{2}
, RNAAlphabet{2}
and AminoAcidAlphabet
can be created with the type aliases DNAKmer
, RNAKmer
and AAKmer
:
julia> DNAKmer{3}("tag")
DNA 3-mer:
TAG
julia> AAKmer{5}("PWYSK")
AminoAcid 5-mer:
PWYSK
For kmers with an Alphabet
that implement the trait BioSequences.AsciiAlphabet
, they can also be constructed from AbstractVector{UInt8}
, in which case the vector is interpreted as being bytes of ASCII text:
julia> AAKmer{3}([0x65, 0x67, 0x7a])
AminoAcid 3-mer:
EGZ
When constructing from an AbstractString
(or byte vector), uracil (U
) and thymine T
are treated differently - a U
cannot be read as thymine:
julia> DNAKmer{3}("UAG")
ERROR: cannot encode 0x55 (Char 'U') in DNAAlphabet{2}
[...]
However, when constructing from a BioSequence
, these nucleotides are considered interchangeable:
julia> RNAKmer{4}(dna"TATC")
RNA 4-mer:
UAUC
Finally, kmers can be constructed with a string literal @mer_str
, where the string must be appended with d
for DNA, r
for RNA, or a
for amino acid:
julia> mer"UGCUGA"r
RNA 6-mer:
UGCUGA
julia> mer"EDEHL"a
AminoAcid 5-mer:
EDEHL
Since the literals produce the kmer at parse time and inserts the kmer directly into the abstract syntax tree, this will always be type stable, and the overhead related to parsing the string will not be paid.
In the following example, each iteration takes less than 1 nanosecond, which implies that parsing the string literal mer"AAA"d
is not done in the loop at runtime:
julia> function count_aaas(dna)
x = 0
for kmer in FwDNAMers{3}(dna)
# The parsing happens once here, when the
# code is parsed, and is fine to have in the loop
x += kmer == mer"AAA"d
end
x
end;
julia> seq = randseq(DNAAlphabet{2}(), 100_000_000);
julia> count_aaas(seq); # compile
julia> @time count_aaas(seq) # about 1ns per iteration
0.088556 seconds (1 allocation: 16 bytes)
1561361
Indexing
Kmers support most normal indexing, such as scalar indexing:
julia> mer"CAGCU"r[3]
RNA_G
Slicing
julia> mer"AGGCTA"d[2:5]
DNA 4-mer:
GGCT
And indexing with boolean vectors, and vectors of indices:
julia> m = mer"MDGKRY"a;
julia> m[[true, false, true, true, false, true]]
AminoAcid 4-mer:
MGKY
julia> m[[4,2]]
AminoAcid 2-mer:
KD
A note on type stability
Except scalar indexing which always returns a single symbol, all the operations above are type unstable, since the length (and thus type) of the resulting kmer depends on the input value, not its type.
However, type unstable functions may be type-stable, if the indexing value is known at compile time, and the Julia compiler uses constant folding:
julia> f(x) = x[2:5]; # 2:5 is a compile time constant
julia> Test.@inferred f(mer"UCGUAGC"r)
RNA 4-mer:
CGUA
Reference
Kmers.Kmer
— TypeKmer{A<:Alphabet,K,N} <: BioSequence{A}
An immutable bitstype for representing k-mers - short BioSequences
of a fixed length K
. Since they can be stored directly in registers, Kmer
s are generally the most efficient type of BioSequence
, when K
is small and known at compile time.
The N
parameter is derived from A
and K
and is not a free parameter.
See also: DNAKmer
, RNAKmer
, AAKmer
, AbstractKmerIterator
Examples
julia> RNAKmer{5}("ACGUC")
RNA 5-mer:
ACGUC
julia> Kmer{DNAAlphabet{4}, 6}(dna"TGCTTA")
DNA 6-mer:
TGCTTA
julia> AAKmer{5}((lowercase(i) for i in "KLWYR"))
AminoAcid 5-mer:
KLWYR
julia> RNAKmer{3}("UAUC") # wrong length
ERROR:
[...]
Kmers.derive_type
— Functionderive_type(::Type{Kmer{A, K}}) -> Type{Kmer{A, K, N}}
Compute the fully parameterized kmer type from only the parameters A
and K
.
Kmers.Mer
— TypeMer{K}
Alias for Kmer{<:Alphabet, K}
. Useful to dispatch on K-mers
without regard for the alphabat
Example
julia> mer"DEKR"a isa Mer{4}
true
julia> DNAKmer{6}("TGATCA") isa Mer{6}
true
julia> RNACodon <: Mer{3}
true
Kmers.@mer_str
— Macro@mer_str -> Kmer
Construct a Kmer
from the given string. The macro must be used with a flag after the input string, e.g. d
in mer"TAG"d
or a
in mer"PCW"a
, signifying the alphabet of the kmer. The flags d = DNAAlphabet{2}
, r = RNAAlphabet{2}
and a = AminoAcidAlphabet
are recognized.
Because the macro is resolved and the kmer is created at parse time, the macro is type stable, and may be used in high performance code.
Examples
julia> mer"UGCUA"r
RNA 5-mer:
UGCUA
julia> mer"YKVSTEDLLKKR"a
AminoAcid 12-mer:
YKVSTEDLLKKR
julia> mer"TATTAGCA"d
DNA 8-mer:
TATTAGCA
Kmers.DNAKmer
— TypeAlias for Kmer{DNAAlphabet{2},K,N}
Kmers.RNAKmer
— TypeAlias for Kmer{RNAAlphabet{2},K,N}
Kmers.AAKmer
— TypeAlias for Kmer{AminoAcidAlphabet,K,N}
Kmers.DNACodon
— TypeAlias for DNAKmer{3,1}
Kmers.RNACodon
— TypeAlias for RNAKmer{3,1}
Kmers.pop
— Functionpop(kmer::Kmer{A, K})::Kmer{A, K-1}
Returns a new kmer with the last symbol of the input kmer
removed. Throws an ArgumentError
if kmer
is empty.
Since the output of this function is a K-1
-mer, use of this function in a loop may result in type-instability.
See also: pop_first
, push
, shift
Examples
julia> pop(mer"TCTGTA"d)
DNA 5-mer:
TCTGT
julia> pop(mer"QPSY"a)
AminoAcid 3-mer:
QPS
julia> pop(mer""a)
ERROR: ArgumentError:
[...]
Kmers.pop_first
— Functionpop_first(kmer::Kmer{A, K})::Kmer{A, K-1}
Returns a new kmer with the first symbol of the input kmer
removed. Throws an ArgumentError
if kmer
is empty.
Since the output of this function is a K-1
-mer, use of this function in a loop may result in type-instability.
Examples
julia> pop_first(mer"TCTGTA"d)
DNA 5-mer:
CTGTA
julia> pop_first(mer"QPSY"a)
AminoAcid 3-mer:
PSY
julia> pop_first(mer""a)
ERROR: ArgumentError:
[...]
Kmers.push
— Functionpush(kmer::Kmer{A, K}, s)::Kmer{A, K+1}
Create a new kmer which is the concatenation of kmer
and s
. Returns a K+1
-mer.
Since the output of this function is a K+1
-mer, use of this function in a loop may result in type-instability.
See also: push_first
, pop
, shift
Examples
julia> push(mer"UGCUGA"r, RNA_G)
RNA 7-mer:
UGCUGAG
julia> push(mer"W"a, 'E')
AminoAcid 2-mer:
WE
Kmers.push_first
— Functionpush_first(kmer::Kmer{A, K}, s)::Kmer{A, K+1}
Create a new kmer which is the concatenation of s
and kmer
. Returns a K+1
-mer. Similar to push
, but places the new symbol s
at the front.
Since the output of this function is a K+1
-mer, use of this function in a loop may result in type-instability.
Examples
julia> push_first(mer"GCU"r, RNA_G)
RNA 4-mer:
GGCU
julia> push_first(mer"W"a, 'E')
AminoAcid 2-mer:
EW
Kmers.shift
— Functionshift(kmer::Kmer{A, K}, s)::Kmer{A, K}
Push symbol
onto the end of kmer
, and pop the first symbol in kmer
. Unlike push
, this preserves the input type, and is less likely to result in type instability.
See also: shift_first
, push
Examples
julia> shift(mer"TACC"d, DNA_A)
DNA 4-mer:
ACCA
julia> shift(mer"WKYMLPIIRS"aa, 'F')
AminoAcid 10-mer:
KYMLPIIRSF
Kmers.shift_first
— Functionshift_first(kmer::kmer, symbol)::typeof(kmer)
Push symbol
onto the start of kmer
, and pop the last symbol in kmer
.
Examples
julia> shift_first(mer"TACC"d, DNA_A)
DNA 4-mer:
ATAC
julia> shift_first(mer"WKYMLPIIRS"aa, 'F')
AminoAcid 10-mer:
FWKYMLPIIR