Building kmer replacements
Kmer replacements is the general term for sequences that can be represented computationally like kmers, but are sampled differently. Examples include minimizers, strobemers, syncmers and k-min-mers.
Since there is no end to the variations of kmer replacements, Kmers.jl does not try to implement all of them. Instead, Kmers.jl implements the base kmer type, and exposes some efficient primitives to allow downstream users to build kmer replacements.
These functions are:
Kmers.unsafe_extract — Functionunsafe_extract(::RecodingScheme, T::Type{<:Kmer}, seq, from::Int) -> TExtract a Kmer of type T from seq beginning at index from. This function is useful to create kmer or kmer-like types.
This function does not do any bounds checking, so the user must know that from:from+K-1 is inbounds in seq. The validity of the data in the seq is validated by this function.
Examples
julia> seq = b"TAGCTAGA";
julia> Kmers.unsafe_extract(Kmers.AsciiEncode(), DNAKmer{4, 1}, seq, 2)
DNA 4-mer:
AGCTKmers.shift_encoding — Functionshift_encoding(kmer::T, encoding::UInt) where {T <: Kmer} -> TAdd encoding, a valid encoding in the alphabet of the kmer, to the end of kmer and discarding the first symbol in kmer.
It is the user's responsibility to ensure that encoding is valid.
Examples
julia> enc = UInt(0x0a); # encoding of DNA_Y in 4-bit alphabets
julia> kmer = Kmer{DNAAlphabet{4}, 4}("TAGA");
julia> Kmers.shift_encoding(kmer, enc)
DNA 4-mer:
AGAYKmers.unsafe_shift_from — Functionunsafe_shift_from(::RecodingScheme, kmer::T, seq, from::Int, ::Val{S}) -> TExtract S::Int symbols from sequence seq at positions from:from+S-1, and shift them into kmer.
This function does not do any bounds checking, so it is the user's responsibility to ensure that from is inbounds, and the recoding scheme valid. It is assumed that S < K, where K == length(kmer). If S ≥ K, use unsafe_extract instead.
Examples
julia> seq = dna"TAGCGGA";
julia> kmer = mer"GGTG"d;
julia> Kmers.unsafe_shift_from(Kmers.FourToTwo(), kmer, seq, 3, Val(2))
DNA 4-mer:
TGGCExample: Minimizers
Minimizers are currently the most common kmer replacement. They are defined as the minimum of W consecutive kmers, as ordered by some ordering O.
If we use fx_hash as the ordering function, and assume K and W are known at compile time, we can implement it reasonably efficiently like so:
function unsafe_extract_minimizer(
seq::LongDNA{2},
i::Int,
::Val{K},
::Val{W},
) where {K, W}
T = derive_type(Kmer{DNAAlphabet{2}, K})
kmer = Kmers.unsafe_extract(Kmers.Copyable(), T, seq, i)
hash = fx_hash(kmer)
for offset in 0:W-2
new_kmer = Kmers.unsafe_shift_from(Kmers.Copyable(), kmer, seq, i+K+offset, Val(1))
new_hash = fx_hash(new_kmer)
if new_hash < hash
hash = new_hash
kmer = new_kmer
end
end
kmer
end
rng = Random.Xoshiro(1)
unsafe_extract_minimizer(randseq(rng, DNAAlphabet{2}(), 100), 1, Val(5), Val(9))
# output
DNA 5-mer:
TATCA