Counting
BioSequences
contains functionality to efficiently count biosymbols in a biosequence that satisfies some predicate.
Consider a naive counting function like this:
function count_Ns(seq::BioSequence{<:DNAAlphabet})
ns = 0
for i in seq
ns += (i == DNA_N)::Bool
end
ns
end
This function can be more efficiently implemented by exploiting the internal data layout of certain biosequences. Therefore, Julia provides optimised methods for Base.count
, such that count_Ns
above can be more efficiently expressed count(==(DNA_N), seq)
.
It is important to understand that this speed is achieved with custom methods of Base.count
, and not by a generic mechanism that improves the speed of counting symbols in BioSequence
in general. Hence, while count(==(DNA_N), seq)
may be optimised, count(i -> i == DNA_N, seq)
is not, as this is a different method.
Currently optimised methods
By default, only the BioSequence
and Alphabet
types found in BioSequences.jl
have optimised methods.
count(isGC, seq)
count(isambiguous, seq)
count(iscertain, seq)
count(isgap, seq)
count(==(biosymbol), seq)
andcount(isequal(biosymbol), seq)
Matches and mismatches
The methods matches
and mismatches
take two sequences and count the number of positions where the sequences are unequal or equal, respectively.
They are equivalent to matches(a, b) = count(splat(==), zip(a, b))
(and with !=
, respectively).
BioSequences.matches
— Functionmatches(a::BioSequence, b::BioSequences) -> Int
Count the number of positions in where a
and b
are equal. If b
is given, and the length of a
and b
differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A
does not match DNA_N
.
Passing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.
Examples
julia> matches(dna"TAWNNA", dna"TACCTA")
3
julia> matches(dna"AACA", dna"AAG")
2
BioSequences.mismatches
— Functionmismatches(a::BioSequence, b::BioSequences) -> Int
Count the number of positions in where a
and b
differ. If b
is given, and the length of a
and b
differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A
does not match DNA_N
.
Passing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.
Examples
julia> mismatches(dna"TAGCTA", dna"TACNTA")
2
julia> mismatches(dna"AACA", dna"AAG")
1
GC content
The convenience function gc_content(seq)
is equivalent to count(isGC, seq) / length(seq)
:
BioSequences.gc_content
— Functiongc_content(seq::BioSequence) -> Float64
Calculate GC content of seq
, i.e. the number of symbols that is DNA_C
, DNA_G
, DNA_C
or DNA_G
divided by the length of the sequence.
Examples
julia> gc_content(dna"AGCTA")
0.4
julia> gc_content(rna"UAGCGA")
0.5
Deprecated aliases
Several of the optimised count
methods have function names, which are deprecated:
Deprecated function | Instead use |
---|---|
n_gaps | count(isgap, seq) |
n_certain | count(iscertain, seq) |
n_ambiguous | count(isambiguous, seq) |
BioSequences.n_gaps
— Functionn_gaps(a::BioSequence, [b::BioSequence]) -> Int
Count the number of positions where a
(or b
, if present) have gaps. If b
is given, and the length of a
and b
differ, look only at the indices of the shorter sequence.
Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError
Examples
julia> n_gaps(dna"--TAC-WN-ACY")
4
julia> n_gaps(dna"TC-AC-", dna"-CACG")
2
BioSequences.n_certain
— Functionn_certain(a::BioSequence, [b::BioSequence]) -> Int
Count the number of positions where a
(and b
, if present) have certain (i.e. non-ambigous and non-gap) symbols. If b
is given, and the length of a
and b
differ, look only at the indices of the shorter sequence. Gaps are not certain.
Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError
Examples
julia> n_certain(dna"--TAC-WN-ACY")
5
julia> n_certain(rna"UAYWW", rna"UAW")
2
BioSequences.n_ambiguous
— Functionn_ambiguous(a::BioSequence, [b::BioSequence]) -> Int
Count the number of positions where a
(or b
, if present) have ambigious symbols. If b
is given, and the length of a
and b
differ, look only at the indices of the shorter sequence. Gaps are not ambigous.
Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError
Examples
julia> n_ambiguous(dna"--TAC-WN-ACY")
3
julia> n_ambiguous(rna"UAYWW", rna"UAW")
1