Abstract Types

BioSequences exports an abstract BioSequence type, and several concrete sequence types which inherit from it.

The abstract BioSequence

BioSequences provides an abstract type called a BioSequence{A<:Alphabet}. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).

BioSequences.BioSequenceType
BioSequence{A <: Alphabet}

BioSequence is the main abstract type of BioSequences. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet, which controls the element type.

Extended help

Its subtypes are characterized by:

  • Being a linear container type with random access and indices Base.OneTo(length(x)).
  • Containing zero or more internal data elements of type encoded_data_eltype(typeof(x)).
  • Being associated with an Alphabet, A by being a subtype of BioSequence{A}.

A BioSequence{A} is indexed by an integer. The biosequence subtype, the index and the alphabet A determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence are separated.

Subtypes T of BioSequence must implement the following, with E begin an encoded data type:

  • Base.length(::T)::Int
  • encoded_data_eltype(::Type{T})::Type{E}
  • extract_encoded_element(::T, ::Integer)::E
  • copy(::T)
  • T must be able to be constructed from any iterable with length defined and with a known, compatible element type.

Furthermore, mutable sequences should implement

  • encoded_setindex!(::T, ::E, ::Integer)
  • T(undef, ::Int)
  • resize!(::T, ::Int)

For compatibility with existing Alphabets, the encoded data eltype must be UInt.

source

Some aliases for BioSequence are also provided for your convenience:

Let's have a closer look at some of those methods that a subtype of BioSequence must implement. Check out julia base library docs for length, copy and resize!.

BioSequences.encoded_data_eltypeFunction
encoded_data_eltype(::Type{<:BioSequence})

Returns the element type of the encoded data of the BioSequence. This is the return type of extract_encoded_element, i.e. the data type that stores the biological symbols in the biosequence.

See also: BioSequence

source
BioSequences.extract_encoded_elementFunction
extract_encoded_element(::BioSequence{A}, i::Integer)

Returns the encoded element at position i. This data can be decoded using decode(A(), data) to yield the element type of the biosequence.

See also: BioSequence

source
BioSequences.encoded_setindex!Function
encoded_setindex!(seq::BioSequence, x::E, i::Integer)

Given encoded data x of type encoded_data_eltype(typeof(seq)), sets the internal sequence data at the given index.

See also: BioSequence

source

A correctly defined subtype of BioSequence that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence type overloads Indeed some of the generic BioSequence methods, are overloaded for LongSequence, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.

The abstract Alphabet

Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet} will get.

BioSequences.AlphabetType
Alphabet

Alphabet is the most important type trait for BioSequence. An Alphabet represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.

Extended help

  • Subtypes of Alphabet are singleton structs that may or may not be parameterized.
  • Alphabets span over a finite set of biological symbols.
  • The alphabet controls the encoding from some internal "encoded data" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.
  • An Alphabet's encode method must not produce invalid data.

Every subtype A of Alphabet must implement:

  • Base.eltype(::Type{A})::Type{S} for some eltype S, which must be a BioSymbol.
  • symbols(::A)::Tuple{Vararg{S}}. This gives tuples of all symbols in the set of A.
  • encode(::A, ::S)::E encodes a symbol to an internal data eltype E.
  • decode(::A, ::E)::S decodes an internal data eltype E to a symbol S.
  • Except for eltype which must follow Base conventions, all functions operating on Alphabet should operate on instances of the alphabet, not the type.

If you want interoperation with existing subtypes of BioSequence, the encoded representation E must be of type UInt, and you must also implement:

  • BitsPerSymbol(::A)::BitsPerSymbol{N}, where the N must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].

For increased performance, see BioSequences.AsciiAlphabet

source
BioSequences.AsciiAlphabetType
AsciiAlphabet

Trait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet() for a user-defined Alphabet A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A)) and ascii_encode(A, ::UInt8).

source

Concrete types

Implemented alphabets

BioSequences.DNAAlphabetType

DNA nucleotide alphabet.

DNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source
BioSequences.RNAAlphabetType

RNA nucleotide alphabet.

RNAAlphabet has a parameter N which is a number that determines the BitsPerSymbol trait. Currently supported values of N are 2 and 4.

source

Long Sequences

BioSequences.LongSequenceType
LongSequence{A <: Alphabet}

General-purpose BioSequence. This type is mutable and variable-length, and should be preferred for most use cases.

Extended help

LongSequence{A<:Alphabet} <: BioSequence{A} is parameterized by a concrete Alphabet type A that defines the domain (or set) of biological symbols permitted.

As the BioSequence interface definition implies, LongSequences store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence, the Alphabet determines how an element is encoded or decoded when it is inserted or extracted from the sequence.

For example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the LongSequence{AminoAcidAlphabet} type represents a sequence of amino acids.

Symbols from multiple alphabets can't be intermixed in one sequence type.

The following table summarizes common LongSequence types that have been given aliases for convenience.

TypeSymbol typeType alias
LongSequence{DNAAlphabet{N}}DNALongDNA{N}
LongSequence{RNAAlphabet{N}}RNALongRNA{N}
LongSequence{AminoAcidAlphabet}AminoAcidLongAA

The LongDNA and LongRNA aliases use a DNAAlphabet{4}.

DNAAlphabet{4} permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence does).

If you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences parameterised with DNAAlphabet{2} instead.

DNAAlphabet{2} is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).

Changing this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.

The same applies with LongSequence{RNAAlphabet{4}}, simply replace the alphabet parameter with RNAAlphabet{2} in order to benefit.

source

Sequence views

Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequences - the LongSubSeq{A<:Alphabet}.

Conceptually, a LongSubSeq{A} is similar to a LongSequence{A}, but instead of storing their own data, they refer to the data of a LongSequence. Modiying the LongSequence will be reflected in the view, and vice versa. If the underlying LongSequence is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.

The purpose of LongSubSeq is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.