API: The MerTools submodule

Note

This is a reference of an internal sub-module's API for developers and experienced users. First ask yourself if what you need isn't covered by the higher-level WorkSpace API.

Types

GenomeGraphs.MerTools.MerCountType

A simple mer count struct.

MerCount is a simple struct that binds a mer value to a count of the number of times it has been observed. This type, (sorted) vectors of them, and some additional utility methods, form the basic building blocks of the higher-level mer counting functionality of the MerTools sub-module.

Note

The count is stored as an UInt8 because often once the count is more than 255 we hardly care anymore.

Public / Safe methods

GenomeGraphs.MerTools.collapse_into_countsFunction
collapse_into_counts(mers::Vector{M}) where {M<:AbstractMer}

Build a vector of sorted MerCounts from a Vector of a mer type.

This is a basic kernel function used for any higher level and more complex kmer counting procedures.

GenomeGraphs.MerTools.collapse_into_counts!Function
collapse_into_counts!(result::Vector{MerCount{M}}, mers::Vector{M}) where {M<:AbstractMer}

Build a vector of sorted MerCounts from a Vector of a mer type.

This is a basic kernel function used for any higher level and more complex kmer counting procedures.

This is like collapse_into_counts, except it's first argument is a result vector that is cleared and filled with the result.

Note

The input vector mers will be sorted by this method.

GenomeGraphs.MerTools.merge_into!Function
merge_into!(a::Vector{MerCount{M}}, b::Vector{MerCount{M}}) where {M<:AbstractMer}

Merge the MerCounts from vector b into the vector a.

Note

This will sort the input vectors a and b.

GenomeGraphs.MerTools.build_freq_listFunction
build_freq_list(::Type{M}, sbuf::SequenceBuffer{PairedReads}, range::UnitRange{Int}) where {M<:AbstractMer}

Build a sorted list (vector) of kmer counts (MerFreq), serially and in memory.

This function is a serial and in memory MerFreq list builder that can build a kmer count from a PairedReads datastore on its own (if you have memory and time), but it is also intended to be composed into other multi-process or multi-threaded kmer counting strategies.

This method estimates roughly how many kmers will be generated by the reads specified by range in the dataset. It then pre-allocates an array to contain them. It then collects the kmers, sorts, them, and then collapses them into a list of counts sorted by the kmer.

build_freq_list(::Type{M}, sbuf::SequenceBuffer{PairedReads}, range::UnitRange{Int}, chunk_size::Int) where {M<:AbstractMer}

Build a sorted list (vector) of kmer counts (MerFreq), serially and in memory.

This function is a serial and in memory MerFreq list builder that can build a kmer count from a PairedReads datastore on it own (if you have memory and time), but it is also intended to be composed into other multi-process or multi-threaded kmer counting strategies.

This method pre-allocates space for chunk_size kmers, and iterates over kmers in the reads in the dataset specified by range until the buffer is filled. The mers are then collapsed into a list of counts, sorted by the kmer. This list is then merged into another output list. This process repeats for many chunks of kmers, building up the output list.

This method is useful for situations where you don't want (or have the space) to allocate a buffer to collect all the kmers in the dataset all in one go.

Internal / Unsafe methods

GenomeGraphs.MerTools.unsafe_collapse_into_counts!Function
unsafe_collapse_into_counts!(result::Vector{MerCount{M}}, mers::Vector{M}) where {M<:AbstractMer}
Warning

This method is marked as unsafe because it assumes that the mers input vector is already sorted.

GenomeGraphs.MerTools.unsafe_merge_into!Function
unsafe_merge_into!(a::Vector{MerCount{M}}, b::Vector{MerCount{M}}) where {M<:AbstractMer}

Merge the MerCounts from vector b into the vector a.

Warning

This method is marked as unsafe as it assumes both of the input vectors a and b are already sorted.