Sequence composition
There are many instances in analyzing sequence data where you will want to know about the composition of your sequences.
For example, for a given sequence, you may want to count how many of each possible Kmer, is present in the sequence. This would be important if - for instance - you wanted to analyze the Kmer spectra of your data. Alternatively you might have a collection of sequences, and may want to count how many of each unique sequence you have in your collection. This would be important if - for instance - your collection of sequences were from a population sample, and you wanted to compute the allele or genotype frequencies for the population.
Whatever the application, BioSequences provides a method called composition, and a parametric struct called Composition to both compute, and handle the results of such sequence composition calculations.
BioSequences.Composition — TypeSequence composition.
This is a subtype of Associative{T,Int}, and the getindex method returns the number of occurrences of a symbol or a k-mer.
BioSequences.composition — Functioncomposition(seq | kmer_iter)Calculate composition of biological symbols in seq or k-mers in kmer_iter.
composition(iter)A generalised composition algorithm, which computes the number of unique items produced by an iterable.
Example
# Example, counting unique sequences.
julia> a = dna"AAAAAAAATTTTTT"
14nt DNA Sequence:
AAAAAAAATTTTTT
julia> b = dna"AAAAAAAATTTTTT"
14nt DNA Sequence:
AAAAAAAATTTTTT
julia> c = a[5:10]
6nt DNA Sequence:
AAAATT
julia> composition([a, b, c])
Vector{BioSequences.BioSequence{BioSequences.DNAAlphabet{4}}} Composition:
AAAATT => 1
AAAAAAAATTTTTT => 2For example to get the nucleotide composition of a sequence:
julia> comp = composition(dna"ACGAG");
julia> comp[DNA_A]
2
julia> comp[DNA_T]
0
Composition structs behave like an associative collection, such as a Dict. But there are a few differences:
- The
getindexmethod for Composition structs is overloaded to return a default value of 0, if a key is used that is not present in the Composition. - The
merge!method for two Composition structs adds counts together, unlike themerge!method for other associative containers, which would overwrite the counts.
merge! is used to accumulate composition statistics of multiple sequences:
julia> # initiaize an empty composition counter
comp = composition(dna"");
ERROR: LoadError: UndefVarError: @dna_str not defined
in expression starting at none:2
julia> # iterate over sequences and accumulate composition statistics into `comp`
for seq in seqs
merge!(comp, composition(seq))
end
ERROR: UndefVarError: seqs not defined
julia> # or functional programming style in one line
foldl((x, y) -> merge(x, composition(y)), composition(dna""), seqs)
ERROR: LoadError: UndefVarError: @dna_str not defined
in expression starting at none:3composition is also applicable to a k-mer iterator:
julia> comp = composition(each(DNAMer{4}, dna"ACGT"^100));
julia> comp[DNAMer("ACGT")]
100
julia> comp[DNAMer("CGTA")]
99