Analysing k-mer count data

Once you have k-mer count data, there are a number of useful analyses that may be done. Let's go through a few together.

K-mer frequency spectra

A k-mer frequency spectra is a representation of sequencing data showing how many k-mers appear a certain number of times. In other words, how many k-mers in the dataset have a count of 1, how many have a count of 10, and so on.

When plotted, the frequency of occurence is plotted on the x-axis, and the number of k-mers on the y-axis.

The k-mer spectra is composed of distributions representing groups of motifs at different frequencies in the sample, plus biases. Given not too many biases, the shape of the distributions provides a useful set of properties describing the biological sample, the sequencing process and the amount of useful data in the dataset.

Let's take a look at a simple spectra. I'm going to use a small artificial dataset of a bacterial-like genome I generated using Pseudoseq.jl.

I first created a random single chromosome genome, and then simulated a paired-end sequencing experiment to generate the kind of raw data we would get from a sequencing experiment.

Note

I'm doing this with a small artificial genome to be kind to the CI infrastructure that builds these docs!!!

But you could follow along and try yourself with paired-end read data from a sequencing experiment from e.coli.

Here I'm going to load in these reads, count the k-mers with the SerialMemCounter k-mer counter as the dataset is not that large, and then compute the 1D frequency spectra.

julia> using KmerAnalysis, ReadDatastores, BioSequences

julia> reads = @openreads "../../../test/fakemicrobe.prseq"
Paired Read Datastore 'fakemicrobe-pe': 83850 reads (41925 pairs)

julia> counter = serial_mem(DNAMer{31}, CANONICAL)
(::KmerAnalysis.SerialMemCounter{BioSequences.Mer{BioSequences.DNAAlphabet{2},31},Canonical}) (generic function with 1 method)

julia> kmer_counts = counter(reads)
1561193-element Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},31}},1}:
 AAAAAAAAAAAGGTATCATCCTGTAGCTAGT occurs 1 times
 AAAAAAAAAAGGTATCATCCTGTAGCTAGTG occurs 1 times
 AAAAAAAAAATCTCGGACAGCGCCGGTCAGC occurs 15 times
 AAAAAAAAAATCTCTGACAGCGCCGGTCAGC occurs 1 times
 AAAAAAAAAATGCGACGTGTCCCAGCAGAGC occurs 16 times
 AAAAAAAAAATGCGACGTGTCCGAGCAGAGC occurs 1 times
 AAAAAAAAACGCCTAGTAGTACTTTAATGCG occurs 17 times
 AAAAAAAAACGCTGGTATTCGTGTTTGGTGG occurs 10 times
 AAAAAAAAAGACCAAGAGACATCGTGCTGGG occurs 23 times
 AAAAAAAAAGGTATCATCCTGTAGCTAGTGT occurs 1 times
 ⋮
 TTTTAGGAGAAAAGCGCTTGCGCAAGCAAAA occurs 18 times
 TTTTCACACAGTCTAAGATTATTCAAGAAAA occurs 25 times
 TTTTCAGCAGATCCGCCGATCTCGCTCAAAA occurs 1 times
 TTTTCAGCAGATCCGCCGATCTCGCTGAAAA occurs 22 times
 TTTTCGTACACCGGGTAAGTAACGTTAAAAA occurs 1 times
 TTTTCGTACTCCGGGTAAGTAACGTTAAAAA occurs 16 times
 TTTTGACTCTCACTACCTAGGCATCTAAAAA occurs 17 times
 TTTTGAGGTGCAAGCCGGAACAAAAACAAAA occurs 22 times
 TTTTGCTGATCACGTGAGACGATGCACAAAA occurs 20 times

julia> my_spectra = spectra(kmer_counts)
Count histogram of motifs appearing more than 0 times

I could have also created the counter and passed it to the spectra method along with the reads like so:

julia> my_spectra = spectra(counter, reads)
Count histogram of motifs appearing more than 0 times

Kmer count projection

KmerAnalysis includes a k-mer count container type called IndexedCounts. Indexed counts are different to a vector or other container of MerCount, in that IndexedCounts store k-mer counts from many sources (often different read datasets) indexed against a reference dataset (usually a genome graph or set of contigs).