Abstract Types
BioSequences exports an abstract BioSequence
type, and several concrete sequence types which inherit from it.
The abstract BioSequence
BioSequences provides an abstract type called a BioSequence{A<:Alphabet}
. This abstract type, and the methods and traits is supports, allows for many algorithms in BioSequences to be written as generically as possible, thus reducing the amount of code to read and understand, whilst maintaining high performance when such code is compiled for a concrete BioSequence subtype. Additionally, it allows new types to be implemented that are fully compatible with the rest of BioSequences, providing that key methods or traits are defined).
BioSequences.BioSequence
— TypeBioSequence{A <: Alphabet}
BioSequence
is the main abstract type of BioSequences
. It abstracts over the internal representation of different biological sequences, and is parameterized by an Alphabet
, which controls the element type.
Extended help
Its subtypes are characterized by:
- Being a linear container type with random access and indices
Base.OneTo(length(x))
. - Containing zero or more internal data elements of type
encoded_data_eltype(typeof(x))
. - Being associated with an
Alphabet
,A
by being a subtype ofBioSequence{A}
.
A BioSequence{A}
is indexed by an integer. The biosequence subtype, the index and the alphabet A
determine how to extract the internal encoded data. The alphabet decides how to decode the data to the element type of the biosequence. Hence, the element type and container type of a BioSequence
are separated.
Subtypes T
of BioSequence
must implement the following, with E
begin an encoded data type:
Base.length(::T)::Int
encoded_data_eltype(::Type{T})::Type{E}
extract_encoded_element(::T, ::Integer)::E
copy(::T)
- T must be able to be constructed from any iterable with
length
defined and with a known, compatible element type.
Furthermore, mutable sequences should implement
encoded_setindex!(::T, ::E, ::Integer)
T(undef, ::Int)
resize!(::T, ::Int)
For compatibility with existing Alphabet
s, the encoded data eltype must be UInt
.
Some aliases for BioSequence
are also provided for your convenience:
BioSequences.NucSeq
— TypeAn alias for BioSequence{<:NucleicAcidAlphabet}
BioSequences.AASeq
— TypeAn alias for BioSequence{AminoAcidAlphabet}
Let's have a closer look at some of those methods that a subtype of BioSequence
must implement. Check out julia base library docs for length
, copy
and resize!
.
BioSequences.encoded_data_eltype
— Functionencoded_data_eltype(::Type{<:BioSequence})
Returns the element type of the encoded data of the BioSequence
. This is the return type of extract_encoded_element
, i.e. the data type that stores the biological symbols in the biosequence.
See also: BioSequence
BioSequences.extract_encoded_element
— Functionextract_encoded_element(::BioSequence{A}, i::Integer)
Returns the encoded element at position i
. This data can be decoded using decode(A(), data)
to yield the element type of the biosequence.
See also: BioSequence
BioSequences.encoded_setindex!
— Functionencoded_setindex!(seq::BioSequence, x::E, i::Integer)
Given encoded data x
of type encoded_data_eltype(typeof(seq))
, sets the internal sequence data at the given index.
See also: BioSequence
A correctly defined subtype of BioSequence
that satisfies the interface, will find the vast majority of methods described in the rest of this manual should work out of the box for that type. But they can always be overloaded if needed. Indeed the LongSequence
type overloads Indeed some of the generic BioSequence
methods, are overloaded for LongSequence
, for example for transformation and counting operations where efficiency gains can be made due to the specific internal representation of a specific type.
The abstract Alphabet
Alphabets control how biological symbols are encoded and decoded. They also confer many of the automatic traits and methods that any subtype of T<:BioSequence{A<:Alphabet}
will get.
BioSequences.Alphabet
— TypeAlphabet
Alphabet
is the most important type trait for BioSequence
. An Alphabet
represents a set of biological symbols encoded by a sequence, e.g. A, C, G and T for a DNA Alphabet that requires only 2 bits to represent each symbol.
Extended help
- Subtypes of Alphabet are singleton structs that may or may not be parameterized.
- Alphabets span over a finite set of biological symbols.
- The alphabet controls the encoding from some internal "encoded data" to a BioSymbol of the alphabet's element type, as well as the decoding, the inverse process.
- An
Alphabet
'sencode
method must not produce invalid data.
Every subtype A
of Alphabet
must implement:
Base.eltype(::Type{A})::Type{S}
for some eltypeS
, which must be aBioSymbol
.symbols(::A)::Tuple{Vararg{S}}
. This gives tuples of all symbols in the set ofA
.encode(::A, ::S)::E
encodes a symbol to an internal data eltypeE
.decode(::A, ::E)::S
decodes an internal data eltypeE
to a symbolS
.- Except for
eltype
which must follow Base conventions, all functions operating onAlphabet
should operate on instances of the alphabet, not the type.
If you want interoperation with existing subtypes of BioSequence
, the encoded representation E
must be of type UInt
, and you must also implement:
BitsPerSymbol(::A)::BitsPerSymbol{N}
, where theN
must be zero or a power of two in [1, 2, 4, 8, 16, 32, [64 for 64-bit systems]].
For increased performance, see BioSequences.AsciiAlphabet
BioSequences.AsciiAlphabet
— TypeAsciiAlphabet
Trait for alphabet using ASCII characters as String representation. Define codetype(A) = AsciiAlphabet()
for a user-defined Alphabet
A to gain speed. Methods needed: BioSymbols.stringbyte(::eltype(A))
and ascii_encode(A, ::UInt8)
.
Concrete types
Implemented alphabets
BioSequences.DNAAlphabet
— TypeDNA nucleotide alphabet.
DNAAlphabet
has a parameter N
which is a number that determines the BitsPerSymbol
trait. Currently supported values of N
are 2 and 4.
BioSequences.RNAAlphabet
— TypeRNA nucleotide alphabet.
RNAAlphabet
has a parameter N
which is a number that determines the BitsPerSymbol
trait. Currently supported values of N
are 2 and 4.
BioSequences.AminoAcidAlphabet
— TypeAmino acid alphabet.
Long Sequences
BioSequences.LongSequence
— TypeLongSequence{A <: Alphabet}
General-purpose BioSequence
. This type is mutable and variable-length, and should be preferred for most use cases.
Extended help
LongSequence{A<:Alphabet} <: BioSequence{A}
is parameterized by a concrete Alphabet
type A
that defines the domain (or set) of biological symbols permitted.
As the BioSequence
interface definition implies, LongSequence
s store the biological symbol elements that they contain in a succinct encoded form that permits many operations to be done in an efficient bit-parallel manner. As per the interface of BioSequence
, the Alphabet
determines how an element is encoded or decoded when it is inserted or extracted from the sequence.
For example, AminoAcidAlphabet
is associated with AminoAcid
and hence an object of the LongSequence{AminoAcidAlphabet}
type represents a sequence of amino acids.
Symbols from multiple alphabets can't be intermixed in one sequence type.
The following table summarizes common LongSequence types that have been given aliases for convenience.
Type | Symbol type | Type alias |
---|---|---|
LongSequence{DNAAlphabet{N}} | DNA | LongDNA{N} |
LongSequence{RNAAlphabet{N}} | RNA | LongRNA{N} |
LongSequence{AminoAcidAlphabet} | AminoAcid | LongAA |
The LongDNA
and LongRNA
aliases use a DNAAlphabet{4}.
DNAAlphabet{4}
permits ambiguous nucleotides, and a sequence must use at least 4 bits to internally store each element (and indeed LongSequence
does).
If you are sure that you are working with sequences with no ambiguous nucleotides, you can use LongSequences
parameterised with DNAAlphabet{2}
instead.
DNAAlphabet{2}
is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (A,C,G,T).
Changing this single parameter, is all you need to do in order to benefit from memory savings. Some computations that use bitwise operations will also be dramatically faster.
The same applies with LongSequence{RNAAlphabet{4}}
, simply replace the alphabet parameter with RNAAlphabet{2}
in order to benefit.
Sequence views
Similar to how Base Julia offers views of array objects, BioSequences offers view of LongSequence
s - the LongSubSeq{A<:Alphabet}
.
Conceptually, a LongSubSeq{A}
is similar to a LongSequence{A}
, but instead of storing their own data, they refer to the data of a LongSequence
. Modiying the LongSequence
will be reflected in the view, and vice versa. If the underlying LongSequence
is truncated, the behaviour of a view is undefined. For the same reason, some operations are not supported for views, such as resizing.
The purpose of LongSubSeq
is that, since they only contain a pointer to the underlying array, an offset and a length, they are much lighter than LongSequences
, and will be stack allocated on Julia 1.5 and newer. Thus, the user may construct millions of views without major performance implications.