Counting

BioSequences contains functionality to efficiently count biosymbols in a biosequence that satisfies some predicate.

Consider a naive counting function like this:

function count_Ns(seq::BioSequence{<:DNAAlphabet})
    ns = 0
    for i in seq
        ns += (i == DNA_N)::Bool
    end
    ns
end 

This function can be more efficiently implemented by exploiting the internal data layout of certain biosequences. Therefore, Julia provides optimised methods for Base.count, such that count_Ns above can be more efficiently expressed count(==(DNA_N), seq).

Note

It is important to understand that this speed is achieved with custom methods of Base.count, and not by a generic mechanism that improves the speed of counting symbols in BioSequencein general. Hence, while count(==(DNA_N), seq) may be optimised, count(i -> i == DNA_N, seq) is not, as this is a different method.

Currently optimised methods

By default, only the BioSequence and Alphabet types found in BioSequences.jl have optimised methods.

  • count(isGC, seq)
  • count(isambiguous, seq)
  • count(iscertain, seq)
  • count(isgap, seq)
  • count(==(biosymbol), seq) and count(isequal(biosymbol), seq)

Matches and mismatches

The methods matches and mismatches take two sequences and count the number of positions where the sequences are unequal or equal, respectively.

They are equivalent to matches(a, b) = count(splat(==), zip(a, b)) (and with !=, respectively).

BioSequences.matchesFunction
matches(a::BioSequence, b::BioSequences) -> Int

Count the number of positions in where a and b are equal. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A does not match DNA_N.

Warning

Passing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.

Examples

julia> matches(dna"TAWNNA", dna"TACCTA")
3

julia> matches(dna"AACA", dna"AAG")
2
source
BioSequences.mismatchesFunction
mismatches(a::BioSequence, b::BioSequences) -> Int

Count the number of positions in where a and b differ. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. This function does not provide any special handling of ambiguous symbols, so e.g. DNA_A does not match DNA_N.

Warning

Passing in two sequences with differing lengths is deprecated. In a future, breaking release of BioSequences, this will error.

Examples

julia> mismatches(dna"TAGCTA", dna"TACNTA")
2

julia> mismatches(dna"AACA", dna"AAG")
1
source

GC content

The convenience function gc_content(seq) is equivalent to count(isGC, seq) / length(seq):

BioSequences.gc_contentFunction
gc_content(seq::BioSequence) -> Float64

Calculate GC content of seq, i.e. the number of symbols that is DNA_C, DNA_G, DNA_C or DNA_G divided by the length of the sequence.

Examples

julia> gc_content(dna"AGCTA")
0.4

julia> gc_content(rna"UAGCGA")
0.5
source

Deprecated aliases

Several of the optimised count methods have function names, which are deprecated:

Deprecated functionInstead use
n_gapscount(isgap, seq)
n_certaincount(iscertain, seq)
n_ambiguouscount(isambiguous, seq)
BioSequences.n_gapsFunction
n_gaps(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (or b, if present) have gaps. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_gaps(dna"--TAC-WN-ACY")
4

julia> n_gaps(dna"TC-AC-", dna"-CACG")
2
source
BioSequences.n_certainFunction
n_certain(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (and b, if present) have certain (i.e. non-ambigous and non-gap) symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not certain.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_certain(dna"--TAC-WN-ACY")
5

julia> n_certain(rna"UAYWW", rna"UAW")
2
source
BioSequences.n_ambiguousFunction
n_ambiguous(a::BioSequence, [b::BioSequence]) -> Int

Count the number of positions where a (or b, if present) have ambigious symbols. If b is given, and the length of a and b differ, look only at the indices of the shorter sequence. Gaps are not ambigous.

Warning

Passing in two sequences is deprecated. In a future, breaking release of BioSequences, this will throw a MethodError

Examples

julia> n_ambiguous(dna"--TAC-WN-ACY")
3

julia> n_ambiguous(rna"UAYWW", rna"UAW")
1
source