Building kmer replacements

Kmer replacements is the general term for sequences that can be represented computationally like kmers, but are sampled differently. Examples include minimizers, strobemers, syncmers and k-min-mers.

Since there is no end to the variations of kmer replacements, Kmers.jl does not try to implement all of them. Instead, Kmers.jl implements the base kmer type, and exposes some efficient primitives to allow downstream users to build kmer replacements.

These functions are:

Kmers.unsafe_extractFunction
unsafe_extract(::RecodingScheme, T::Type{<:Kmer}, seq, from::Int) -> T

Extract a Kmer of type T from seq beginning at index from. This function is useful to create kmer or kmer-like types.

This function does not do any bounds checking, so the user must know that from:from+K-1 is inbounds in seq. The validity of the data in the seq is validated by this function.

Examples

julia> seq = b"TAGCTAGA";

julia> Kmers.unsafe_extract(Kmers.AsciiEncode(), DNAKmer{4, 1}, seq, 2)
DNA 4-mer:
AGCT
source
Kmers.shift_encodingFunction
shift_encoding(kmer::T, encoding::UInt) where {T <: Kmer} -> T

Add encoding, a valid encoding in the alphabet of the kmer, to the end of kmer and discarding the first symbol in kmer.

It is the user's responsibility to ensure that encoding is valid.

Examples

julia> enc = UInt(0x0a); # encoding of DNA_Y in 4-bit alphabets

julia> kmer = Kmer{DNAAlphabet{4}, 4}("TAGA");

julia> Kmers.shift_encoding(kmer, enc)
DNA 4-mer:
AGAY
source
Kmers.unsafe_shift_fromFunction
unsafe_shift_from(::RecodingScheme, kmer::T, seq, from::Int, ::Val{S}) -> T

Extract S::Int symbols from sequence seq at positions from:from+S-1, and shift them into kmer.

This function does not do any bounds checking, so it is the user's responsibility to ensure that from is inbounds, and the recoding scheme valid. It is assumed that S < K, where K == length(kmer). If S ≥ K, use unsafe_extract instead.

Examples

julia> seq = dna"TAGCGGA";

julia> kmer = mer"GGTG"d;

julia> Kmers.unsafe_shift_from(Kmers.FourToTwo(), kmer, seq, 3, Val(2))
DNA 4-mer:
TGGC
source

Example: Minimizers

Minimizers are currently the most common kmer replacement. They are defined as the minimum of W consecutive kmers, as ordered by some ordering O.

If we use fx_hash as the ordering function, and assume K and W are known at compile time, we can implement it reasonably efficiently like so:

function unsafe_extract_minimizer(
    seq::LongDNA{2},
    i::Int,
    ::Val{K},
    ::Val{W},
) where {K, W}
    T = derive_type(Kmer{DNAAlphabet{2}, K})
    kmer = Kmers.unsafe_extract(Kmers.Copyable(), T, seq, i)
    hash = fx_hash(kmer)
    for offset in 0:W-2
        new_kmer = Kmers.unsafe_shift_from(Kmers.Copyable(), kmer, seq, i+K+offset, Val(1))
        new_hash = fx_hash(new_kmer)
        if new_hash < hash
            hash = new_hash
            kmer = new_kmer
        end
    end
    kmer
end

rng = Random.Xoshiro(1)
unsafe_extract_minimizer(randseq(rng, DNAAlphabet{2}(), 100), 1, Val(5), Val(9))

# output
DNA 5-mer:
TATCA