Counting k-mers in read datasets
MerCounting provides some more dedicated counting algorithms for counting k-mers in read datasets, particularly ReadDatastores.
This is because as conceptually simple as counting k-mers is, to do it quickly for large sequencing read datasets output by sequencing machines can be difficult for some datasets.
There are many ways you can try to optimise the k-mer counting process, and many k-mer counting tools already exist.
MerCounting provides a Counters submodule, which contains nessecery methods and types required to implement various kinds of k-mer counter, as well as exporting a selection of "off-the-shelf" methods that use different counting strategies, one or several of which hopefully suit the user's dataset and computational resources available. We describe and showcase these below.
The SerialMemCounter counter
The SerialMemCounter counter is the simplest k-mer counter for reads that is provided.
It counts k-mers in each read serially, and stores the k-mers and counts entirely in RAM.
KmerAnalysis.SerialMemCounter — TypeSerialMemCounter{M<:AbstractMer,C<:CountMode}KmerAnalysis' simplest kmer counter.
Build a sorted list (vector) of kmer counts (MerCount), serially and entirely in memory.
Build a SerialMemCounter using the serial_mem method. This returns a SerialMemCounter.
A SerialMemCounter can be treated as a function or functor and passed to other functions as an argument, and it can be called on other arguments.
It is a general purpose kmer counting method:
It basically collects all the kmers from iterating over the input, before sorting the collected kmers and collapsing them into counts. To reduce allocations and GC from many repeated calls of a SerialMemCounter, they keep a few buffers that are used often for kmer collection and count collapsing.
Whilst general purpose for a variety of argument types, it is most suitable for:
Counting the kmers in a small number of long-ish
BioSequences(e.g. a reference genome or low(ish)-complexity genome graph.).Counting the kmers in a ReadDatastore if the dataset size is small enough such that all the kmers collected from every dataset can fit in a single machines memory. If you have a decent MacBook Pro, for example, datasets like E. coli paired end sequencing reads will be no problem.
Other larger ReadDatastores IF you have the RAM to throw at it, e.g. you're on a HPC machine configured for large memory jobs.
It is fairly simple to use. First you need a ReadDatastore, if you need to recall how to create one then head over to here in ReadDatastores.jl's documentation.
The example below opens a datastore before creating a SerialMemCounter with serial_mem, to count the k-mers in the read datastore.
julia> using KmerAnalysis, ReadDatastores
julia> ds = @openreads "ecoli-test-paired.prseq"
Paired Read Datastore 'my-ecoli-test': 20 reads (10 pairs)
julia> c = serial_mem(DNAMer{31}, CANONICAL)
(::KmerAnalysis.SerialMemCounter{BioSequences.Mer{BioSequences.DNAAlphabet{2},31},Canonical}) (generic function with 1 method)
julia> kl = c(ds)
4916-element Array{MerCount{BioSequences.Mer{BioSequences.DNAAlphabet{2},31}},1}:
AAAAAAAAGTAATCGTCGGCATGTCCGGCGG occurs 1 times
AAAAAAAGTAATCGTCGGCATGTCCGGCGGT occurs 1 times
AAAAAAATAAAATTTAGTGCTGTACAGAGCG occurs 1 times
AAAAAAGTAATCGTCGGCATGTCCGGCGGTG occurs 1 times
AAAAAATAAAATTTAGTGCTGTACAGAGCGC occurs 1 times
AAAAAATCCACCAGCAGCAGGTTGATGTCAG occurs 1 times
AAAAAATCGACCAGGAGCGGTTGGATGACAG occurs 1 times
AAAAAATCGTTGCGCTTTTTCTCGCCGCCGG occurs 1 times
AAAAACATCGCGGGTGTTGAAGAAAAATTAA occurs 1 times
AAAAACCAACGTTTACCGAACGGGTTAATAA occurs 2 times
⋮
TTCGGCGTGCGACCGGCTTTATATTCGGCAA occurs 1 times
TTCTGGCTGAAGAGGTGGTGGAAATTCCCAA occurs 1 times
TTCTTGCTTAAGAGGTGGTGGAAATTCCCAA occurs 1 times
TTGCCATTACCTATGTCTACAAAGGTGACAA occurs 1 times
TTGCGAACAATTATTCCCCATGCTTCAGCAA occurs 1 times
TTGCGCTTTTTCTCGCCGCCGGAAAAACCAA occurs 2 times
TTGGCAGCATCTTCTTTGGTGGTTGCACCAA occurs 1 times
TTGTCCCAACACATGTGCCAGCCGATAAAAA occurs 1 times
TTTGATTTTCAGGATTTGATGGAAGAGAAAA occurs 1 timesThe dist_mem counter
The DistMemCounter counter is a multi-process version of SerialMemCounter.
KmerAnalysis.DistMemCounter — TypeThe multi-process parallel version of the SerialMemCounter k-mer counter.
Like the SerialMemCounter this counter does all counting and processing in memory, so it is not designed to limit memory use, but if using a cluster of machines the memory used per worker machine would be less. Of course the data is copied back over to the master process and the results combined and so the master node has to have enough RAM available to hold all the result.
It is used in a very similar fashion to serial_mem, but you need to have multiple julia worker processes available to gain any benefit.
julia> # Spin up some extra julia worker processes, make sure they are using the
# appropriate Project.toml file for your project.
using Distributed
julia> addprocs(8, exeflags="--project=../../../docs/")
8-element Array{Int64,1}:
2
3
4
5
6
7
8
9
julia> # Make sure the packages are loaded on all the workers.
@everywhere using KmerAnalysis, ReadDatastores, BioSequences
julia> ds = @openreads "ecoli-test-paired.prseq"
Paired Read Datastore 'my-ecoli-test': 20 reads (10 pairs)
julia> dmc = dist_mem(DNAMer{31}, CANONICAL)
(::KmerAnalysis.DistMemCounter{Mer{DNAAlphabet{2},31},Canonical}) (generic function with 1 method)
julia> kl = dmc(ds)
[ Info: Splitting all reads in ecoli-test-paired.prseq, across 8 worker processes and counting kmers
[ Info: Waiting for workers to finish
ERROR: TaskFailedException:
On worker 2:
MethodError: no method matching serial_mem(::Type{Mer{DNAAlphabet{2},31}}, ::DatastoreBuffer{PairedReads{DNAAlphabet{2}}}, ::Canonical, ::StepRange{UInt64,Int64})
Closest candidates are:
serial_mem(::Type{M}, !Matched::PairedReads{A}, ::CountMode) where {M, A} at /home/runner/.julia/packages/KmerAnalysis/za454/src/counters/serial_mem.jl:59 (method too new to be called from this world context.)
serial_mem(::Type{M}, !Matched::C) where {M<:AbstractMer, C<:CountMode} at /home/runner/.julia/packages/KmerAnalysis/za454/src/counters/serial_mem.jl:42 (method too new to be called from this world context.)
_do_serial_mem at /home/runner/.julia/packages/KmerAnalysis/za454/src/counters/dist_mem.jl:5
#108 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:294
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:79
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/process_messages.jl:294 [inlined]
#107 at ./task.jl:333
Stacktrace:
[1] #remotecall_fetch#145(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Distributed.remotecall_fetch), ::Function, ::Distributed.Worker, ::Type, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:390
[2] remotecall_fetch(::Function, ::Distributed.Worker, ::Type, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:382
[3] #remotecall_fetch#148(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(Distributed.remotecall_fetch), ::Function, ::Int64, ::Type, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:417
[4] remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.3/Distributed/src/remotecall.jl:417 [inlined]
[5] (::KmerAnalysis.var"#1#2"{PairedReads{DNAAlphabet{2}},Mer{DNAAlphabet{2},31},String,Canonical,Array{Array{MerCount{Mer{DNAAlphabet{2},31}},1},1},Int64,Int64})() at ./task.jl:333
...and 7 more exception(s).Naturally, for such a small and simple example as this, using DistMemCounter for such a small dataset is probably just worse than doing it in memory serially, because of the overheads. For real datasets you should see a benefit.