Building kmer replacements
Kmer replacements is the general term for sequences that can be represented computationally like kmers, but are sampled differently. Examples include minimizers, strobemers, syncmers and k-min-mers.
Since there is no end to the variations of kmer replacements, Kmers.jl does not try to implement all of them. Instead, Kmers.jl implements the base kmer type, and exposes some efficient primitives to allow downstream users to build kmer replacements.
These functions are:
Kmers.unsafe_extract
— Functionunsafe_extract(::RecodingScheme, T::Type{<:Kmer}, seq, from::Int) -> T
Extract a Kmer
of type T
from seq
beginning at index from
. This function is useful to create kmer or kmer-like types.
This function does not do any bounds checking, so the user must know that from:from+K-1
is inbounds in seq
. The validity of the data in the seq
is validated by this function.
Examples
julia> seq = b"TAGCTAGA";
julia> Kmers.unsafe_extract(Kmers.AsciiEncode(), DNAKmer{4, 1}, seq, 2)
DNA 4-mer:
AGCT
Kmers.shift_encoding
— Functionshift_encoding(kmer::T, encoding::UInt) where {T <: Kmer} -> T
Add encoding
, a valid encoding in the alphabet of the kmer
, to the end of kmer
and discarding the first symbol in kmer
.
It is the user's responsibility to ensure that encoding
is valid.
Examples
julia> enc = UInt(0x0a); # encoding of DNA_Y in 4-bit alphabets
julia> kmer = Kmer{DNAAlphabet{4}, 4}("TAGA");
julia> Kmers.shift_encoding(kmer, enc)
DNA 4-mer:
AGAY
Kmers.unsafe_shift_from
— Functionunsafe_shift_from(::RecodingScheme, kmer::T, seq, from::Int, ::Val{S}) -> T
Extract S::Int
symbols from sequence seq
at positions from:from+S-1
, and shift them into kmer
.
This function does not do any bounds checking, so it is the user's responsibility to ensure that from
is inbounds, and the recoding scheme valid. It is assumed that S < K
, where K == length(kmer)
. If S ≥ K
, use unsafe_extract
instead.
Examples
julia> seq = dna"TAGCGGA";
julia> kmer = mer"GGTG"d;
julia> Kmers.unsafe_shift_from(Kmers.FourToTwo(), kmer, seq, 3, Val(2))
DNA 4-mer:
TGGC
Example: Minimizers
Minimizers are currently the most common kmer replacement. They are defined as the minimum of W consecutive kmers, as ordered by some ordering O.
If we use fx_hash
as the ordering function, and assume K and W are known at compile time, we can implement it reasonably efficiently like so:
function unsafe_extract_minimizer(
seq::LongDNA{2},
i::Int,
::Val{K},
::Val{W},
) where {K, W}
T = derive_type(Kmer{DNAAlphabet{2}, K})
kmer = Kmers.unsafe_extract(Kmers.Copyable(), T, seq, i)
hash = fx_hash(kmer)
for offset in 0:W-2
new_kmer = Kmers.unsafe_shift_from(Kmers.Copyable(), kmer, seq, i+K+offset, Val(1))
new_hash = fx_hash(new_kmer)
if new_hash < hash
hash = new_hash
kmer = new_kmer
end
end
kmer
end
rng = Random.Xoshiro(1)
unsafe_extract_minimizer(randseq(rng, DNAAlphabet{2}(), 100), 1, Val(5), Val(9))
# output
DNA 5-mer:
TATCA