Counting k-mers in read datasets

MerCounting provides some more dedicated counting algorithms for counting k-mers in read datasets, particularly ReadDatastores.

This is because as conceptually simple as counting k-mers is, to do it quickly for large sequencing read datasets output by sequencing machines can be difficult for some datasets.

There are many ways you can try to optimise the k-mer counting process, and many k-mer counting tools already exist.

MerCounting provides a Counters submodule, which contains nessecery methods and types required to implement various kinds of k-mer counter, as well as exporting a selection of "off-the-shelf" methods that use different counting strategies, one or several of which hopefully suit the user's dataset and computational resources available. We describe and showcase these below.

The `SerialMemCounter` counter

The SerialMemCounter counter is the simplest k-mer counter for reads that is provided.

It counts k-mers in each read serially, and stores the k-mers and counts entirely in RAM.

KmerAnalysis.SerialMemCounter — Type

SerialMemCounter{M<:AbstractMer,C<:CountMode}

KmerAnalysis' simplest kmer counter.

Build a sorted list (vector) of kmer counts (MerCount), serially and entirely in memory.

Build a SerialMemCounter using the serial_mem method. This returns a SerialMemCounter.

A SerialMemCounter can be treated as a function or functor and passed to other functions as an argument, and it can be called on other arguments.

It is a general purpose kmer counting method:

It basically collects all the kmers from iterating over the input, before sorting the collected kmers and collapsing them into counts. To reduce allocations and GC from many repeated calls of a SerialMemCounter, they keep a few buffers that are used often for kmer collection and count collapsing.

Whilst general purpose for a variety of argument types, it is most suitable for:

Counting the kmers in a small number of long-ish BioSequences (e.g. a reference genome or low(ish)-complexity genome graph.).
Counting the kmers in a ReadDatastore if the dataset size is small enough such that all the kmers collected from every dataset can fit in a single machines memory. If you have a decent MacBook Pro, for example, datasets like E. coli paired end sequencing reads will be no problem.
Other larger ReadDatastores IF you have the RAM to throw at it, e.g. you're on a HPC machine configured for large memory jobs.

It is fairly simple to use. First you need a ReadDatastore, if you need to recall how to create one then head over to here in ReadDatastores.jl's documentation.

The example below opens a datastore before creating a SerialMemCounter with serial_mem, to count the k-mers in the read datastore.

julia> using KmerAnalysis, ReadDatastores

julia> ds = @openreads "ecoli-test-paired.prseq"
Paired Read Datastore 'my-ecoli-test': 20 reads (10 pairs)

julia> c = serial_mem(DNAMer{31}, CANONICAL)
(::KmerAnalysis.SerialMemCounter{BioSequences.Mer{BioSequences.DNAAlphabet{2},31},Canonical}) (generic function with 1 method)

julia> kl = c(ds)
4916-element Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},31}},1}:
 AAAAAAAAGTAATCGTCGGCATGTCCGGCGG occurs 1 times
 AAAAAAAGTAATCGTCGGCATGTCCGGCGGT occurs 1 times
 AAAAAAATAAAATTTAGTGCTGTACAGAGCG occurs 1 times
 AAAAAAGTAATCGTCGGCATGTCCGGCGGTG occurs 1 times
 AAAAAATAAAATTTAGTGCTGTACAGAGCGC occurs 1 times
 AAAAAATCCACCAGCAGCAGGTTGATGTCAG occurs 1 times
 AAAAAATCGACCAGGAGCGGTTGGATGACAG occurs 1 times
 AAAAAATCGTTGCGCTTTTTCTCGCCGCCGG occurs 1 times
 AAAAACATCGCGGGTGTTGAAGAAAAATTAA occurs 1 times
 AAAAACCAACGTTTACCGAACGGGTTAATAA occurs 2 times
 ⋮
 TTCGGCGTGCGACCGGCTTTATATTCGGCAA occurs 1 times
 TTCTGGCTGAAGAGGTGGTGGAAATTCCCAA occurs 1 times
 TTCTTGCTTAAGAGGTGGTGGAAATTCCCAA occurs 1 times
 TTGCCATTACCTATGTCTACAAAGGTGACAA occurs 1 times
 TTGCGAACAATTATTCCCCATGCTTCAGCAA occurs 1 times
 TTGCGCTTTTTCTCGCCGCCGGAAAAACCAA occurs 2 times
 TTGGCAGCATCTTCTTTGGTGGTTGCACCAA occurs 1 times
 TTGTCCCAACACATGTGCCAGCCGATAAAAA occurs 1 times
 TTTGATTTTCAGGATTTGATGGAAGAGAAAA occurs 1 times

The `dist_mem` counter

The DistMemCounter counter is a multi-process version of SerialMemCounter.

KmerAnalysis.DistMemCounter — Type

The multi-process parallel version of the SerialMemCounter k-mer counter.

Like the SerialMemCounter this counter does all counting and processing in memory, so it is not designed to limit memory use, but if using a cluster of machines the memory used per worker machine would be less. Of course the data is copied back over to the master process and the results combined and so the master node has to have enough RAM available to hold all the result.

It is used in a very similar fashion to serial_mem, but you need to have multiple julia worker processes available to gain any benefit.

julia> # Spin up some extra julia worker processes, make sure they are using the
       # appropriate Project.toml file for your project.
       using Distributed

julia> addprocs(8, exeflags="--project=../../../docs/")
8-element Array{Int64,1}:
 2
 3
 4
 5
 6
 7
 8
 9

julia> # Make sure the packages are loaded on all the workers.
       @everywhere using KmerAnalysis, ReadDatastores, BioSequences

julia> ds = @openreads "ecoli-test-paired.prseq"
Paired Read Datastore 'my-ecoli-test': 20 reads (10 pairs)

julia> dmc = dist_mem(DNAMer{31}, CANONICAL)
(::KmerAnalysis.DistMemCounter{Mer{DNAAlphabet{2},31},Canonical}) (generic function with 1 method)

julia> kl = dmc(ds)
[ Info: Splitting all reads in ecoli-test-paired.prseq, across 8 worker processes and counting kmers
[ Info: Waiting for workers to finish
ERROR: TaskFailedException:
On worker 2:
MethodError: no method matching serial_mem(::Type{Mer{DNAAlphabet{2},31}}, ::DatastoreBuffer{PairedReads{DNAAlphabet{2}}}, ::Canonical, ::StepRange{UInt64,Int64})
Closest candidates are:
  serial_mem(::Type{M}, !Matched::PairedReads{A}, ::CountMode) where {M, A} at /home/runner/.julia/packages/KmerAnalysis/za454/src/counters/serial_mem.jl:59 (method too new to be called from this world context.)
  serial_mem(::Type{M}, !Matched::C) where {M<:AbstractMer, C<:CountMode} at /home/runner/.julia/packages/KmerAnalysis/za454/src/counters/serial_mem.jl:42 (method too new to be called from this world context.)
_do_serial_mem at /home/runner/.julia/packages/KmerAnalysis/za454/src/counters/dist_mem.jl:5
#108 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:294
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:79
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:294 [inlined]
#107 at ./task.jl:333
Stacktrace:
 [1] #remotecall_fetch#145(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Distributed.remotecall_fetch), ::Function, ::Distributed.Worker, ::Type, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:390
 [2] remotecall_fetch(::Function, ::Distributed.Worker, ::Type, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:382
 [3] #remotecall_fetch#148(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Distributed.remotecall_fetch), ::Function, ::Int64, ::Type, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:417
 [4] remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:417 [inlined]
 [5] (::KmerAnalysis.var"#1#2"{PairedReads{DNAAlphabet{2}},Mer{DNAAlphabet{2},31},String,Canonical,Array{Array{MerCount{Mer{DNAAlphabet{2},31}},1},1},Int64,Int64})() at ./task.jl:333

...and 7 more exception(s).

Naturally, for such a small and simple example as this, using DistMemCounter for such a small dataset is probably just worse than doing it in memory serially, because of the overheads. For real datasets you should see a benefit.

Counting k-mers in read datasets

The SerialMemCounter counter

The dist_mem counter

The `SerialMemCounter` counter

The `dist_mem` counter