Sequence composition
There are many instances in analyzing sequence data where you will want to know about the composition of your sequences.
For example, for a given sequence, you may want to count how many of each possible Kmer, is present in the sequence. This would be important if - for instance - you wanted to analyze the Kmer spectra of your data. Alternatively you might have a collection of sequences, and may want to count how many of each unique sequence you have in your collection. This would be important if - for instance - your collection of sequences were from a population sample, and you wanted to compute the allele or genotype frequencies for the population.
Whatever the application, BioSequences provides a method called composition
, and a parametric struct called Composition
to both compute, and handle the results of such sequence composition calculations.
BioSequences.Composition
โ Type.Sequence composition.
This is a subtype of Associative{T,Int}
, and the getindex
method returns the number of occurrences of a symbol or a k-mer.
BioSequences.composition
โ Function.composition(seq | kmer_iter)
Calculate composition of biological symbols in seq
or k-mers in kmer_iter
.
composition(iter)
A generalised composition algorithm, which computes the number of unique items produced by an iterable.
Example
# Example, counting unique sequences.
julia> a = dna"AAAAAAAATTTTTT"
14nt DNA Sequence:
AAAAAAAATTTTTT
julia> b = dna"AAAAAAAATTTTTT"
14nt DNA Sequence:
AAAAAAAATTTTTT
julia> c = a[5:10]
6nt DNA Sequence:
AAAATT
julia> composition([a, b, c])
Vector{BioSequences.BioSequence{BioSequences.DNAAlphabet{4}}} Composition:
AAAATT => 1
AAAAAAAATTTTTT => 2
For example to get the nucleotide composition of a sequence:
julia> comp = composition(dna"ACGAG");
julia> comp[DNA_A]
2
julia> comp[DNA_T]
0
Composition structs behave like an associative collection, such as a Dict
. But there are a few differences:
- The
getindex
method for Composition structs is overloaded to return a default value of 0, if a key is used that is not present in the Composition. - The
merge!
method for two Composition structs adds counts together, unlike themerge!
method for other associative containers, which would overwrite the counts.
merge!
is used to accumulate composition statistics of multiple sequences:
julia> # initiaize an empty composition counter
comp = composition(dna"");
ERROR: LoadError: UndefVarError: @dna_str not defined
in expression starting at none:2
julia> # iterate over sequences and accumulate composition statistics into `comp`
for seq in seqs
merge!(comp, composition(seq))
end
ERROR: UndefVarError: seqs not defined
julia> # or functional programming style in one line
foldl((x, y) -> merge(x, composition(y)), composition(dna""), seqs)
ERROR: LoadError: UndefVarError: @dna_str not defined
in expression starting at none:2
composition
is also applicable to a k-mer iterator:
julia> comp = composition(each(DNAKmer{4}, dna"ACGT"^100));
julia> comp[DNAKmer("ACGT")]
100
julia> comp[DNAKmer("CGTA")]
99