General-purpose sequences
BioSequence{A}
is a generic sequence type parameterized by an alphabet type A
that defines the domain (or set) of biological symbols, and each alphabet has an associated symbol type. For example, AminoAcidAlphabet
is associated with AminoAcid
and hence an object of the BioSequence{AminoAcidAlphabet}
type represents a sequence of amino acids. Symbols from multiple alphabets can't be intermixed in one sequence type.
The following table summarizes common sequence types that are defined in the BioSequences
module:
Type | Symbol type | Type alias |
---|---|---|
BioSequence{DNAAlphabet{4}} |
DNA |
DNASequence |
BioSequence{RNAAlphabet{4}} |
RNA |
RNASequence |
BioSequence{AminoAcidAlphabet} |
AminoAcid |
AminoAcidSequence |
BioSequence{CharAlphabet} |
Char |
CharSequence |
Parameterized definition of the BioSequence{A}
type is for the purpose of unifying the data structure and operations of any symbol type. In most cases, users don't have to care about it and can use type aliases listed above. However, the alphabet type fixes the internal memory encoding and plays an important role when optimizing performance of a program (see Using a more compact sequence representation section for low-memory encodings). It also enables a user to define their own alphabet only by defining few numbers of methods. This is described in Defining a new alphabet section.
Constructing sequences
Using string literals
Most immediately, sequence literals can be constructed using the string macros dna
, rna
, aa
, and char
:
julia> dna"TACGTANNATC" 11nt DNA Sequence: TACGTANNATC julia> rna"AUUUGNCCANU" 11nt RNA Sequence: AUUUGNCCANU julia> aa"ARNDCQEGHILKMFPSTWYVX" 21aa Amino Acid Sequence: ARNDCQEGHILKMFPSTWYVX julia> char"αβγδϵ" 5char Char Sequence: αβγδϵ
However it should be noted that by default these sequence literals allocate the BioSequence
object before the code containing the sequence literal is run. This means there may be occasions where your program does not behave as you first expect. For example consider the following code:
julia> function foo() s = dna"CTT" push!(s, DNA_A) end foo (generic function with 1 method)
You might expect that every time you call foo
, that a DNA sequence CTTA
would be returned. You might expect that this is because every time foo
is called, a new DNA sequence variable CTT
is created, and and A
nucleotide is pushed to it, and the result, CTTA
is returned. In other words you might expect the following output:
julia> foo() 4nt DNA Sequence: CTTA julia> foo() 4nt DNA Sequence: CTTA julia> foo() 4nt DNA Sequence: CTTA
However, this is not what happens, instead the following happens:
julia> foo() 4nt DNA Sequence: CTTA julia> foo() 5nt DNA Sequence: CTTAA julia> foo() 6nt DNA Sequence: CTTAAA
The reason for this is because the sequence literal is allocated only once before the first time the function foo
is called and run. Therefore, s
in foo
is always a reference to that one sequence that was allocated. So one sequence is created before foo
is called, and then it is pushed to every time foo
is called. Thus, that one allocated sequence grows with every call of foo
.
If you wanted foo
to create a new sequence each time it is called, then you can add a flag to the end of the sequence literal to dictate behaviour: A flag of 's' means 'static': the sequence will be allocated before code is run, as is the default behaviour described above. However providing 'd' flag changes the behaviour: 'd' means 'dynamic': the sequence will be allocated at whilst the code is running, and not before. So to change foo
so as it creates a new sequence each time it is called, simply add the 'd' flag to the sequence literal:
julia> function foo() s = dna"CTT"d # 'd' flag appended to the string literal. push!(s, DNA_A) end foo (generic function with 1 method)
Now every time foo
is called, a new sequence CTT
is created, and an A
nucleotide is pushed to it:
julia> foo() 4nt DNA Sequence: CTTA julia> foo() 4nt DNA Sequence: CTTA julia> foo() 4nt DNA Sequence: CTTA
So the take home message of sequence literals is this:
Be careful when you are using sequence literals inside of functions, and inside the bodies of things like for loops. And if you use them and are unsure, use the 's' and 'd' flags to ensure the behaviour you get is the behaviour you intend.
Other constructors and conversion
Sequences can also be constructed from strings or arrays of nucleotide or amino acid symbols using constructors or the convert
function:
julia> DNASequence("TTANC") 5nt DNA Sequence: TTANC julia> DNASequence([DNA_T, DNA_T, DNA_A, DNA_N, DNA_C]) 5nt DNA Sequence: TTANC julia> convert(DNASequence, [DNA_T, DNA_T, DNA_A, DNA_N, DNA_C]) 5nt DNA Sequence: TTANC
Using convert
, these operations are reversible: sequences can be converted to strings or arrays:
julia> convert(String, dna"TTANGTA") "TTANGTA" julia> convert(Vector{DNA}, dna"TTANGTA") 7-element Array{BioSymbols.DNA,1}: DNA_T DNA_T DNA_A DNA_N DNA_G DNA_T DNA_A
Sequences can also be concatenated into longer sequences:
julia> DNASequence(dna"ACGT", dna"NNNN", dna"TGCA") 12nt DNA Sequence: ACGTNNNNTGCA julia> dna"ACGT" * dna"TGCA" 8nt DNA Sequence: ACGTTGCA julia> repeat(dna"TA", 10) 20nt DNA Sequence: TATATATATATATATATATA julia> dna"TA" ^ 10 20nt DNA Sequence: TATATATATATATATATATA
Despite being separate types, DNASequence
and RNASequence
can freely be converted between efficiently without copying the underlying data:
julia> dna = dna"TTANGTAGACCG" 12nt DNA Sequence: TTANGTAGACCG julia> rna = convert(RNASequence, dna) 12nt RNA Sequence: UUANGUAGACCG julia> dna.data === rna.data # underlying data are same true
A random sequence can be obtained by the randdnaseq
, randrnaseq
and randaaseq
functions, which generate DNASequence
, RNASequence
and AminoAcidSequence
, respectively. Generated sequences are composed of the standard symbols without ambiguity and gap. For example, randdnaseq(6)
may generate dna"TCATAG"
but never generates dna"TNANAG"
or dna"T-ATAG"
.
A translatable RNASequence
can also be converted to an AminoAcidSequence
using the translate
function.
Indexing, modifying and transformations
Getindex
Sequences for the most part behave like other vector or string types. They can be indexed using integers or ranges:
julia> seq = dna"ACGTTTANAGTNNAGTACC" 19nt DNA Sequence: ACGTTTANAGTNNAGTACC julia> seq[5] DNA_T julia> seq[6:end] 14nt DNA Sequence: TANAGTNNAGTACC
Note that, indexing a biological sequence by range creates a subsequence of the original sequence. Unlike Arrays
in the standard library, creating a subsequence is copy-free: a subsequence simply points to the original sequence data with its range. You may think that this is unsafe because modifying subsequences propagates to the original sequence, but this doesn't happen actually:
julia> seq = dna"AAAA" # create a sequence 4nt DNA Sequence: AAAA julia> subseq = seq[1:2] # create a subsequence from `seq` 2nt DNA Sequence: AA julia> subseq[2] = DNA_T # modify the second element of it DNA_T julia> subseq # the subsequence is modified 2nt DNA Sequence: AT julia> seq # but the original sequence is not 4nt DNA Sequence: AAAA
This is because modifying a sequence checks whether its underlying data are shared with other sequences under the hood. If and only if the data are shared, the subsequence creates a copy of itself. Any modifying operation does this check. This is called copy-on-write strategy and users don't need to care about it because it is transparent: If the user modifies a sequence with or subsequence, the job of managing and protecting the underlying data of sequences is handled for them.
Setindex and modifying DNA sequences
The biological symbol at a given locus in a biological sequence can be set using setindex:
julia> seq = dna"ACGTTTANAGTNNAGTACC" 19nt DNA Sequence: ACGTTTANAGTNNAGTACC julia> seq[5] = DNA_A DNA_A
In addition, many other modifying operations are possible for biological sequences such as push!
, pop!
, and insert!
, which should be familiar to people used to editing arrays.
push! pop! shift! unshift! insert! deleteat!(::BioSequences.BioSequence, ::Integer) append! copy!
Here are some examples:
julia> seq = dna"ACG" 3nt DNA Sequence: ACG julia> push!(seq, DNA_T) 4nt DNA Sequence: ACGT julia> append!(seq, dna"AT") 6nt DNA Sequence: ACGTAT julia> reverse!(seq) 6nt DNA Sequence: TATGCA julia> complement!(seq) 6nt DNA Sequence: ATACGT julia> reverse_complement!(seq) 6nt DNA Sequence: ACGTAT julia> deleteat!(seq, 2) 5nt DNA Sequence: AGTAT julia> deleteat!(seq, 2:3) 3nt DNA Sequence: AAT
Additional transformations
In addition to these basic modifying functions, other sequence transformations which are common in bioinformatics are also provided.
#
Base.reverse!
— Function.
reverse!(v [, start=1 [, stop=length(v) ]]) -> v
In-place version of reverse
.
#
BioSequences.complement!
— Function.
complement!(seq)
Make a complement sequence of seq
in place.
complement!(seq)
Transform seq
into it's complement.
#
BioSequences.reverse_complement!
— Function.
reverse_complement!(seq)
Make a reversed complement sequence of seq
in place.
Ambiguous nucleotides are left as-is.
#
BioSequences.ungap
— Function.
Create a copy of a sequence with gap characters removed.
#
BioSequences.ungap!
— Function.
Remove gap characters from a sequence. Modifies the input sequence.
julia> seq = dna"ACG" 3nt DNA Sequence: ACG julia> push!(seq, DNA_T) 4nt DNA Sequence: ACGT julia> append!(seq, dna"AT") 6nt DNA Sequence: ACGTAT julia> reverse!(seq) 6nt DNA Sequence: TATGCA julia> complement!(seq) 6nt DNA Sequence: ATACGT julia> reverse_complement!(seq) 6nt DNA Sequence: ACGTAT
Translation
Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.
The translate
funtion translates a sequence of codons in a RNA sequence to a amino acid sequence besed on a genetic code mapping. The BioSequences
module contains all NCBI defined genetic codes and they are registered in ncbi_trans_table
.
#
BioSequences.translate
— Function.
translate(rna_seq, code=standard_genetic_code, allow_ambiguous_codons=true)
Translate an RNASequence
to an AminoAcidSequence
.
Translation uses genetic code code
to map codons to amino acids. See ncbi_trans_table
for available genetic codes. If codons in the given RNA sequence cannot determine a unique amino acid, they will be translated to AA_X
if allow_ambiguous_codons
is true
and otherwise result in an error.
#
BioSequences.ncbi_trans_table
— Constant.
Genetic code list of NCBI.
The standard genetic code is ncbi_trans_table[1]
and others can be shown by show(ncbi_trans_table)
. For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.
julia> ncbi_trans_table Translation Tables: 1. The Standard Code (standard_genetic_code) 2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code) 3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code) 4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code) 5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code) 6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code) 9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code) 10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code) 11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code) 12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code) 13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code) 14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code) 16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code) 21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code) 22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code) 23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code) 24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code) 25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes
Site counting
BioSequences extends the Base.count
method to provide some useful utilities for counting the number of sites in biological sequences.
Site types
Different types of site can be counted. Each of these types is a concrete subtype of the abstract type Site
:
#
BioSequences.Certain
— Type.
A Certain
site describes a site where both of two aligned sites are not an ambiguity symbol or a gap.
#
BioSequences.Gap
— Type.
An Gap
site describes a site where either of two aligned sites are a gap symbol '-'.
#
BioSequences.Ambiguous
— Type.
An Ambiguous
site describes a site where either of two aligned sites are an ambiguity symbol.
#
BioSequences.Match
— Type.
A Match
site describes a site where two aligned nucleotides are the same biological symbol.
#
BioSequences.Mismatch
— Type.
A Mismatch
site describes a site where two aligned nucleotides are not the same biological symbol.
Base.count
methods
The count method can be used with two sequences and a concrete subtype of Site
:
julia> count(Match, dna"ATCGATCG", dna"AAGGTTCG") 5
By providing a window
and step
size, counting can be done from within a sliding window:
julia> count(Match, dna"ATCGATCG", dna"AAGGTTCG", 3, 1) 6-element Array{IntervalTrees.IntervalValue{Int64,Int64},1}: IntervalTrees.IntervalValue{Int64,Int64} (1,3) => 1 IntervalTrees.IntervalValue{Int64,Int64} (2,4) => 1 IntervalTrees.IntervalValue{Int64,Int64} (3,5) => 1 IntervalTrees.IntervalValue{Int64,Int64} (4,6) => 2 IntervalTrees.IntervalValue{Int64,Int64} (5,7) => 2 IntervalTrees.IntervalValue{Int64,Int64} (6,8) => 3
The pairwise_count
function
Counting can also be done on a set of sequences in a pairwise manner with the count_pairwise
function:
julia> count_pairwise(Match, dna"ATCGCCA-", dna"ATCGCCTA", dna"ATCGCCT-", dna"GTCGCCTA") 4×4 PairwiseListMatrices.PairwiseListMatrix{Int64,false}: 0 6 7 5 6 0 7 7 7 7 0 6 5 7 6 0
Iteration
Sequences also work as iterators over symbols:
julia> n = 0 0 julia> for nt in dna"ATNGNNT" if nt == DNA_N n += 1 end end julia> n 3
Using a more compact sequence representation
As we saw above, DNA and RNA sequences can store any ambiguous nucleotides like 'N'. If you are sure that nucleotide sequences store unambiguous nucleotides only, you can save the memory space of sequences. DNAAlphabet{2}
is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (ACGT in DNA and ACGU in RNA). To create a sequence of this alphabet, you need to explicitly pass DNAAlphabet{2}
to BioSequence
as its type parameter:
julia> seq = BioSequence{DNAAlphabet{2}}("ACGT") 4nt DNA Sequence: ACGT
Recall that DNASequence
is a type alias of BioSequence{DNAAlphabet{4}}
, which uses four bits per base. That is, BioSequence{DNAAlphabet{2}}
saves half of memory footprint compared to BioSequence{DNAAlphabet{4}}
. If you need to handle reference genomes that are composed of five nucleotides, ACGTN, consider to use the ReferenceSequence
type described in the Reference sequences section.
Defining a new alphabet
The alphabet type parameter A
of BioSequence{A}
enables a user to extend functionality of BioSequence
with minimum effort. As an example, definition of a new alphabet type representing a sequence of boolean values is shown below:
julia> immutable BoolAlphabet <: Alphabet end julia> BioSequences.bitsof(::Type{BoolAlphabet}) = 1 julia> BioSequences.eltype(::Type{BoolAlphabet}) = Bool julia> BioSequences.alphabet(::Type{BoolAlphabet}) = false:true julia> function BioSequences.encode(::Type{BoolAlphabet}, x::Bool) return UInt64(ifelse(x, 0x01, 0x00)) end julia> function BioSequences.decode(::Type{BoolAlphabet}, x::UInt64) if x > 0x01 throw(BioSequences.DecodeError(BoolAlphabet, x)) end return ifelse(x == 0x00, false, true) end