Analysing k-mer count data
Once you have k-mer count data, there are a number of useful analyses that may be done. Let's go through a few together.
K-mer frequency spectra
A k-mer frequency spectra is a representation of sequencing data showing how many k-mers appear a certain number of times. In other words, how many k-mers in the dataset have a count of 1, how many have a count of 10, and so on.
When plotted, the frequency of occurence is plotted on the x-axis, and the number of k-mers on the y-axis.
The k-mer spectra is composed of distributions representing groups of motifs at different frequencies in the sample, plus biases. Given not too many biases, the shape of the distributions provides a useful set of properties describing the biological sample, the sequencing process and the amount of useful data in the dataset.
Let's take a look at a simple spectra. I'm going to use a small artificial dataset of a bacterial-like genome I generated using Pseudoseq.jl.
I first created a random single chromosome genome, and then simulated a paired-end sequencing experiment to generate the kind of raw data we would get from a sequencing experiment.
I'm doing this with a small artificial genome to be kind to the CI infrastructure that builds these docs!!!
But you could follow along and try yourself with paired-end read data from a sequencing experiment from e.coli.
Here I'm going to load in these reads, count the k-mers with the SerialMemCounter k-mer counter as the dataset is not that large, and then compute the 1D frequency spectra.
julia> using KmerAnalysis, ReadDatastores, BioSequences
julia> reads = @openreads "../../../test/fakemicrobe.prseq"
Paired Read Datastore 'fakemicrobe-pe': 83850 reads (41925 pairs)
julia> counter = serial_mem(DNAMer{31}, CANONICAL)
(::KmerAnalysis.SerialMemCounter{BioSequences.Mer{BioSequences.DNAAlphabet{2},31},Canonical}) (generic function with 1 method)
julia> kmer_counts = counter(reads)
1561193-element Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},31}},1}:
AAAAAAAAAAAGGTATCATCCTGTAGCTAGT occurs 1 times
AAAAAAAAAAGGTATCATCCTGTAGCTAGTG occurs 1 times
AAAAAAAAAATCTCGGACAGCGCCGGTCAGC occurs 15 times
AAAAAAAAAATCTCTGACAGCGCCGGTCAGC occurs 1 times
AAAAAAAAAATGCGACGTGTCCCAGCAGAGC occurs 16 times
AAAAAAAAAATGCGACGTGTCCGAGCAGAGC occurs 1 times
AAAAAAAAACGCCTAGTAGTACTTTAATGCG occurs 17 times
AAAAAAAAACGCTGGTATTCGTGTTTGGTGG occurs 10 times
AAAAAAAAAGACCAAGAGACATCGTGCTGGG occurs 23 times
AAAAAAAAAGGTATCATCCTGTAGCTAGTGT occurs 1 times
⋮
TTTTAGGAGAAAAGCGCTTGCGCAAGCAAAA occurs 18 times
TTTTCACACAGTCTAAGATTATTCAAGAAAA occurs 25 times
TTTTCAGCAGATCCGCCGATCTCGCTCAAAA occurs 1 times
TTTTCAGCAGATCCGCCGATCTCGCTGAAAA occurs 22 times
TTTTCGTACACCGGGTAAGTAACGTTAAAAA occurs 1 times
TTTTCGTACTCCGGGTAAGTAACGTTAAAAA occurs 16 times
TTTTGACTCTCACTACCTAGGCATCTAAAAA occurs 17 times
TTTTGAGGTGCAAGCCGGAACAAAAACAAAA occurs 22 times
TTTTGCTGATCACGTGAGACGATGCACAAAA occurs 20 times
julia> my_spectra = spectra(kmer_counts)
Count histogram of motifs appearing more than 0 timesI could have also created the counter and passed it to the spectra method along with the reads like so:
julia> my_spectra = spectra(counter, reads)
Count histogram of motifs appearing more than 0 timesKmer count projection
KmerAnalysis includes a k-mer count container type called IndexedCounts. Indexed counts are different to a vector or other container of MerCount, in that IndexedCounts store k-mer counts from many sources (often different read datasets) indexed against a reference dataset (usually a genome graph or set of contigs).