Kmer composition
In metagenomics, sequences are often summarized by counting the occurrence of all k-mers of a given length in a sequence. For example, for K=4, there are 4^4 = 256 possible DNA 4-mers. If these counts are ordered, the composition can be represented by a length 256 vector.
Vector similarity operations (e.g. cosine distance) can then be used as an approximate proxy for phylogenetic distance.
In the example below, we exploit that:
- A
DNAKmer{4}
's data is a single-element tuple, which stores the sequence in the 8 lower bits. - The
encoded_data
function will return this tuple.
using BioSequences, FASTX, Kmers
using BioSequences: encoded_data
function composition(record::FASTARecord)
counts = zeros(UInt32, 256)
frequencies = zeros(Float32, 256)
for kmer in FwDNAMers{4}(sequence(record))
@inbounds counts[only(encoded_data(kmer)) + 1] += 1
end
factor = 1 / sum(counts; init=zero(eltype(counts)))
for i in eachindex(counts, frequencies)
frequencies[i] = counts[i] * factor
end
frequencies
end
# Make two FASTA records - could be from an assembly
recs = [FASTARecord(string(i), randdnaseq(10000)) for i in "AB"]
# Compute the 2-norm difference and verify it's in [0, 2].
(comp_a, comp_b) = map(composition, recs)
comp_distance = sum((comp_a .- comp_b).^2)
println(0.0 ≤ comp_distance ≤ 2.0)