The Kmer type

The central type of Kmers.jl is the Kmer. A Kmer is an immutable, bitstype BioSequence, with a length known at compile time. Compared to LongSequence in BioSequences.jl, this gives to one advantage, and comes with two disadvantages:

  • Kmers are much faster than LongSequence, as they can be stored in registers.
  • As kmers gets longer, the code gets increasingly inefficient, as the unrolling and inlining of the immutable operations breaks down.
  • Since their length is part of their type, any operation that results in a kmer whose length cannot be determined at compile time will be type unstable. This includes slicing a kmer, pushing and popping it, and other operations.

The Kmer type is (roughly) defined as

struct Kmer{A <: Alphabet, K, N} <: BioSequence{A}
    x::NTuple{N, UInt}
end

Where:

  • A is the Alphabet as defined in BioSequences.jl.
  • K is the length.
  • N is an extra type parameter derived from the first two used for the length of the tuple, which exists only because Julia does not allow computed type parameters.

Given a Kmer{A, K}, the N parameter may be computed using the function derive_type:

julia> derive_type(Kmer{AminoAcidAlphabet, 6})
Kmer{AminoAcidAlphabet, 6, 1}

Construction

Kmers can be constructed from a BioSequence or AbstractString by explicitly specifying the length of the sequence:

julia> Kmer{DNAAlphabet{2}, 5, 1}("TAGCT")
DNA 5-mer:
TAGCT

The final type parameter N can be elided, in which case it will be inferred:

julia> Kmer{DNAAlphabet{2}, 5}("TAGCT")
DNA 5-mer:
TAGCT

Kmers with alphabets DNAAlphabet{2}, RNAAlphabet{2} and AminoAcidAlphabet can be created with the type aliases DNAKmer, RNAKmer and AAKmer:

julia> DNAKmer{3}("tag")
DNA 3-mer:
TAG

julia> AAKmer{5}("PWYSK")
AminoAcid 5-mer:
PWYSK

For kmers with an Alphabet that implement the trait BioSequences.AsciiAlphabet, they can also be constructed from AbstractVector{UInt8}, in which case the vector is interpreted as being bytes of ASCII text:

julia> AAKmer{3}([0x65, 0x67, 0x7a])
AminoAcid 3-mer:
EGZ

When constructing from an AbstractString (or byte vector), uracil (U) and thymine T are treated differently - a U cannot be read as thymine:

julia> DNAKmer{3}("UAG")
ERROR: cannot encode 0x55 (Char 'U') in DNAAlphabet{2}
[...]

However, when constructing from a BioSequence, these nucleotides are considered interchangeable:

julia> RNAKmer{4}(dna"TATC")
RNA 4-mer:
UAUC

Finally, kmers can be constructed with a string literal @mer_str, where the string must be appended with d for DNA, r for RNA, or a for amino acid:

julia> mer"UGCUGA"r
RNA 6-mer:
UGCUGA

julia> mer"EDEHL"a
AminoAcid 5-mer:
EDEHL

Since the literals produce the kmer at parse time and inserts the kmer directly into the abstract syntax tree, this will always be type stable, and the overhead related to parsing the string will not be paid.

In the following example, each iteration takes less than 1 nanosecond, which implies that parsing the string literal mer"AAA"d is not done in the loop at runtime:

julia> function count_aaas(dna)
           x = 0
           for kmer in FwDNAMers{3}(dna)
               # The parsing happens once here, when the
               # code is parsed, and is fine to have in the loop
               x += kmer == mer"AAA"d
           end
           x
       end;

julia> seq = randseq(DNAAlphabet{2}(), 100_000_000);

julia> count_aaas(seq); # compile

julia> @time count_aaas(seq) # about 1ns per iteration
  0.088556 seconds (1 allocation: 16 bytes)
1561361

Indexing

Kmers support most normal indexing, such as scalar indexing:

julia> mer"CAGCU"r[3]
RNA_G

Slicing

julia> mer"AGGCTA"d[2:5]
DNA 4-mer:
GGCT

And indexing with boolean vectors, and vectors of indices:

julia> m = mer"MDGKRY"a;

julia> m[[true, false, true, true, false, true]]
AminoAcid 4-mer:
MGKY

julia> m[[4,2]]
AminoAcid 2-mer:
KD

A note on type stability

Except scalar indexing which always returns a single symbol, all the operations above are type unstable, since the length (and thus type) of the resulting kmer depends on the input value, not its type.

However, type unstable functions may be type-stable, if the indexing value is known at compile time, and the Julia compiler uses constant folding:

julia> f(x) = x[2:5]; # 2:5 is a compile time constant

julia> Test.@inferred f(mer"UCGUAGC"r)
RNA 4-mer:
CGUA

Reference

Kmers.KmerType
Kmer{A<:Alphabet,K,N} <: BioSequence{A}

An immutable bitstype for representing k-mers - short BioSequences of a fixed length K. Since they can be stored directly in registers, Kmers are generally the most efficient type of BioSequence, when K is small and known at compile time.

The N parameter is derived from A and K and is not a free parameter.

See also: DNAKmer, RNAKmer, AAKmer, AbstractKmerIterator

Examples

julia> RNAKmer{5}("ACGUC")
RNA 5-mer:
ACGUC

julia> Kmer{DNAAlphabet{4}, 6}(dna"TGCTTA")
DNA 6-mer:
TGCTTA

julia> AAKmer{5}((lowercase(i) for i in "KLWYR"))
AminoAcid 5-mer:
KLWYR

julia> RNAKmer{3}("UAUC") # wrong length
ERROR:
[...]
source
Kmers.derive_typeFunction
derive_type(::Type{Kmer{A, K}}) -> Type{Kmer{A, K, N}}

Compute the fully parameterized kmer type from only the parameters A and K.

source
Kmers.MerType
Mer{K}

Alias for Kmer{<:Alphabet, K}. Useful to dispatch on K-mers without regard for the alphabat

Example

julia> mer"DEKR"a isa Mer{4}
true

julia> DNAKmer{6}("TGATCA") isa Mer{6}
true

julia> RNACodon <: Mer{3}
true
source
Kmers.@mer_strMacro
@mer_str -> Kmer

Construct a Kmer from the given string. The macro must be used with a flag after the input string, e.g. d in mer"TAG"d or a in mer"PCW"a, signifying the alphabet of the kmer. The flags d = DNAAlphabet{2}, r = RNAAlphabet{2} and a = AminoAcidAlphabet are recognized.

Because the macro is resolved and the kmer is created at parse time, the macro is type stable, and may be used in high performance code.

Examples

julia> mer"UGCUA"r
RNA 5-mer:
UGCUA

julia> mer"YKVSTEDLLKKR"a
AminoAcid 12-mer:
YKVSTEDLLKKR

julia> mer"TATTAGCA"d
DNA 8-mer:
TATTAGCA
source
Kmers.popFunction
pop(kmer::Kmer{A, K})::Kmer{A, K-1}

Returns a new kmer with the last symbol of the input kmer removed. Throws an ArgumentError if kmer is empty.

Warn

Since the output of this function is a K-1-mer, use of this function in a loop may result in type-instability.

See also: pop_first, push, shift

Examples

julia> pop(mer"TCTGTA"d)
DNA 5-mer:
TCTGT

julia> pop(mer"QPSY"a)
AminoAcid 3-mer:
QPS

julia> pop(mer""a)
ERROR: ArgumentError:
[...]
source
Kmers.pop_firstFunction
pop_first(kmer::Kmer{A, K})::Kmer{A, K-1}

Returns a new kmer with the first symbol of the input kmer removed. Throws an ArgumentError if kmer is empty.

Warn

Since the output of this function is a K-1-mer, use of this function in a loop may result in type-instability.

See also: pop, push, shift

Examples

julia> pop_first(mer"TCTGTA"d)
DNA 5-mer:
CTGTA

julia> pop_first(mer"QPSY"a)
AminoAcid 3-mer:
PSY

julia> pop_first(mer""a)
ERROR: ArgumentError:
[...]
source
Kmers.pushFunction
push(kmer::Kmer{A, K}, s)::Kmer{A, K+1}

Create a new kmer which is the concatenation of kmer and s. Returns a K+1-mer.

Warn

Since the output of this function is a K+1-mer, use of this function in a loop may result in type-instability.

See also: push_first, pop, shift

Examples

julia> push(mer"UGCUGA"r, RNA_G)
RNA 7-mer:
UGCUGAG

julia> push(mer"W"a, 'E')
AminoAcid 2-mer:
WE
source
Kmers.push_firstFunction
push_first(kmer::Kmer{A, K}, s)::Kmer{A, K+1}

Create a new kmer which is the concatenation of s and kmer. Returns a K+1-mer. Similar to push, but places the new symbol s at the front.

Warn

Since the output of this function is a K+1-mer, use of this function in a loop may result in type-instability.

See also: push, pop, shift

Examples

julia> push_first(mer"GCU"r, RNA_G)
RNA 4-mer:
GGCU

julia> push_first(mer"W"a, 'E')
AminoAcid 2-mer:
EW
source
Kmers.shiftFunction
shift(kmer::Kmer{A, K}, s)::Kmer{A, K}

Push symbol onto the end of kmer, and pop the first symbol in kmer. Unlike push, this preserves the input type, and is less likely to result in type instability.

See also: shift_first, push

Examples

julia> shift(mer"TACC"d, DNA_A)
DNA 4-mer:
ACCA

julia> shift(mer"WKYMLPIIRS"aa, 'F')
AminoAcid 10-mer:
KYMLPIIRSF
source
Kmers.shift_firstFunction
shift_first(kmer::kmer, symbol)::typeof(kmer)

Push symbol onto the start of kmer, and pop the last symbol in kmer.

See also: shift, push

Examples

julia> shift_first(mer"TACC"d, DNA_A)
DNA 4-mer:
ATAC

julia> shift_first(mer"WKYMLPIIRS"aa, 'F')
AminoAcid 10-mer:
FWKYMLPIIR
source