The value of hashes are guaranteed to be reproducible for a given version of Kmers.jl and Julia, but may change in new minor versions of Julia or Kmers.jl
Hashing
Kmers.jl implements Base.hash
, yielding a UInt
value:
julia> hash(mer"UGCUGUAC"r)
0xe5057d38c8907b22
The implementation of Base.hash
for kmers strikes a compromise between providing a high-quality (non-cryptographic) hash, while being reasonably fast. While hash collisions can easily be found, they are unlikely to occur at random. When kmers are of the same (or compatible) alphabets, different kmers hash to different values (not counting the occational hash collision), even when they have the same underlying bitpattern:
julia> using BioSequences: encoded_data
julia> a = mer"TAG"d; b = mer"AAAAAAATAG"d;
julia> encoded_data(a) === encoded_data(b)
true
julia> hash(a) == hash(b)
false
When they are of compatible alphabets, and have the same content, they hash to the same value. Currently, only DNA and RNA of the alphabets DNAAlphabet
and RNAAlphabet
are compatible:
julia> a = mer"UUGU"r; b = mer"TTGT"d;
julia> a == b # equal
true
julia> a === b # not egal
false
julia> hash(a) === hash(b)
true
For some applications, fast hashing is absolutely crucial. For these cases, Kmers.jl provides fx_hash
, which trades off hash quality for speed:
Kmers.fx_hash
— Functionfx_hash(x, [h::UInt])::UInt
An implementation of FxHash
. This hash function is extremely fast, but the hashes are of poor quality compared to Julia's default MurmurHash3. In particular:
- The hash function does not have a good avalanche effect, e.g. the lower bits of the result depend only on the top few bits of the input
- The bitpattern zero hashes to zero
However, for many applications, FxHash
is good enough, if the cost of the higher rate of hash collisions are offset by the faster speed.
The precise hash value of a given kmer is not guaranteed to be stable across minor releases of Kmers.jl, but is guaranteed to be stable across minor versions of Julia.
Examples
julia> x = fx_hash(mer"KWQLDE"a);
julia> y = fx_hash(mer"KWQLDE"a, UInt(1));
julia> x isa UInt
true
julia> x == y
false