Overview

Biological sequences

The BioSequences module provides representations and tools for manipulating nucleotide and amino acid sequences.

Introduction to the sequence data-types

Sequences in BioSequences.jl are more strictly typed than in many other libraries; elements in a sequence are typed as biological symbol instead of character or byte. They are special purpose types rather than simply strings and hence offer additional functionality that naive string types don't have. Though this strictness sacrifices some convenience, it also means you can always rely on a DNA sequence type to store DNA and nothing but DNA, without having to check, or deal with lowercase versus uppercase and so on. Strict separation of sequence types also means we are free to choose the most efficient representation. DNA and RNA sequences are encoded using either four bits per base (which is the default), or two bits per base. This makes them memory efficient and allows us to speed up many common operations and transformations, like nucleotide composition, reverse complement, and k-mer enumeration.

The BioSequences provides three different sequence types: BioSequence, Kmer and ReferenceSequence. Each of these types is a subtype of an abstract type called Sequence and supports various string-like operations such as random access and iteration. Different sequence types have different features. In most situations, BioSequence type will do and is used as the default representation. But sometimes other types are much more preferable in terms of memory efficiency and computation performance. Here is the summary table of these three types:

TypeDescriptionElement typeMutabilityAllocation
BioSequence{A<:Alphabet}general-purpose biological sequencesDNA, RNA, Amino acidsmutableheap
Kmer{T<:NucleicAcid,k}specialized for short nucleotide sequencesDNA, RNAimmutablestack / register
ReferenceSequencespecialized for long reference genomesDNAimmutableheap

Details of these different representations are explained in the following sections: