General-purpose sequences
BioSequence{A}
is a generic sequence type parameterized by an alphabet type A
that defines the domain (or set) of biological symbols, and each alphabet has an associated symbol type. For example, AminoAcidAlphabet
is associated with AminoAcid
and hence an object of the BioSequence{AminoAcidAlphabet}
type represents a sequence of amino acids. Symbols from multiple alphabets can't be intermixed in one sequence type.
The following table summarizes common sequence types that are defined in the BioSequences
module:
Type | Symbol type | Type alias |
---|---|---|
BioSequence{DNAAlphabet{4}} | DNA | DNASequence |
BioSequence{RNAAlphabet{4}} | RNA | RNASequence |
BioSequence{AminoAcidAlphabet} | AminoAcid | AminoAcidSequence |
BioSequence{CharAlphabet} | Char | CharSequence |
Parameterized definition of the BioSequence{A}
type is for the purpose of unifying the data structure and operations of any symbol type. In most cases, users don't have to care about it and can use type aliases listed above. However, the alphabet type fixes the internal memory encoding and plays an important role when optimizing performance of a program (see Using a more compact sequence representation section for low-memory encodings). It also enables a user to define their own alphabet only by defining few numbers of methods. This is described in Defining a new alphabet section.
Constructing sequences
Using string literals
Most immediately, sequence literals can be constructed using the string macros dna
, rna
, aa
, and char
:
julia> dna"TACGTANNATC"
11nt DNA Sequence:
TACGTANNATC
julia> rna"AUUUGNCCANU"
11nt RNA Sequence:
AUUUGNCCANU
julia> aa"ARNDCQEGHILKMFPSTWYVX"
21aa Amino Acid Sequence:
ARNDCQEGHILKMFPSTWYVX
julia> char"αβγδϵ"
5char Char Sequence:
αβγδϵ
However it should be noted that by default these sequence literals allocate the BioSequence
object before the code containing the sequence literal is run. This means there may be occasions where your program does not behave as you first expect. For example consider the following code:
julia> function foo()
s = dna"CTT"
push!(s, DNA_A)
end
foo (generic function with 1 method)
You might expect that every time you call foo
, that a DNA sequence CTTA
would be returned. You might expect that this is because every time foo
is called, a new DNA sequence variable CTT
is created, and and A
nucleotide is pushed to it, and the result, CTTA
is returned. In other words you might expect the following output:
julia> foo()
4nt DNA Sequence:
CTTA
julia> foo()
4nt DNA Sequence:
CTTA
julia> foo()
4nt DNA Sequence:
CTTA
However, this is not what happens, instead the following happens:
julia> foo()
4nt DNA Sequence:
CTTA
julia> foo()
5nt DNA Sequence:
CTTAA
julia> foo()
6nt DNA Sequence:
CTTAAA
The reason for this is because the sequence literal is allocated only once before the first time the function foo
is called and run. Therefore, s
in foo
is always a reference to that one sequence that was allocated. So one sequence is created before foo
is called, and then it is pushed to every time foo
is called. Thus, that one allocated sequence grows with every call of foo
.
If you wanted foo
to create a new sequence each time it is called, then you can add a flag to the end of the sequence literal to dictate behaviour: A flag of 's' means 'static': the sequence will be allocated before code is run, as is the default behaviour described above. However providing 'd' flag changes the behaviour: 'd' means 'dynamic': the sequence will be allocated at whilst the code is running, and not before. So to change foo
so as it creates a new sequence each time it is called, simply add the 'd' flag to the sequence literal:
julia> function foo()
s = dna"CTT"d # 'd' flag appended to the string literal.
push!(s, DNA_A)
end
foo (generic function with 1 method)
Now every time foo
is called, a new sequence CTT
is created, and an A
nucleotide is pushed to it:
julia> foo()
4nt DNA Sequence:
CTTA
julia> foo()
4nt DNA Sequence:
CTTA
julia> foo()
4nt DNA Sequence:
CTTA
So the take home message of sequence literals is this:
Be careful when you are using sequence literals inside of functions, and inside the bodies of things like for loops. And if you use them and are unsure, use the 's' and 'd' flags to ensure the behaviour you get is the behaviour you intend.
Other constructors and conversion
Sequences can also be constructed from strings or arrays of nucleotide or amino acid symbols using constructors or the convert
function:
julia> DNASequence("TTANC")
5nt DNA Sequence:
TTANC
julia> DNASequence([DNA_T, DNA_T, DNA_A, DNA_N, DNA_C])
5nt DNA Sequence:
TTANC
julia> convert(DNASequence, [DNA_T, DNA_T, DNA_A, DNA_N, DNA_C])
5nt DNA Sequence:
TTANC
Using convert
, these operations are reversible: sequences can be converted to strings or arrays:
julia> convert(String, dna"TTANGTA")
"TTANGTA"
julia> convert(Vector{DNA}, dna"TTANGTA")
7-element Array{DNA,1}:
DNA_T
DNA_T
DNA_A
DNA_N
DNA_G
DNA_T
DNA_A
Sequences can also be concatenated into longer sequences:
julia> DNASequence(dna"ACGT", dna"NNNN", dna"TGCA")
12nt DNA Sequence:
ACGTNNNNTGCA
julia> dna"ACGT" * dna"TGCA"
8nt DNA Sequence:
ACGTTGCA
julia> repeat(dna"TA", 10)
20nt DNA Sequence:
TATATATATATATATATATA
julia> dna"TA" ^ 10
20nt DNA Sequence:
TATATATATATATATATATA
Despite being separate types, DNASequence
and RNASequence
can freely be converted between efficiently without copying the underlying data:
julia> dna = dna"TTANGTAGACCG"
12nt DNA Sequence:
TTANGTAGACCG
julia> rna = convert(RNASequence, dna)
12nt RNA Sequence:
UUANGUAGACCG
julia> dna.data === rna.data # underlying data are same
true
A random sequence can be obtained by the randdnaseq
, randrnaseq
and randaaseq
functions, which generate DNASequence
, RNASequence
and AminoAcidSequence
, respectively. Generated sequences are composed of the standard symbols without ambiguity and gap. For example, randdnaseq(6)
may generate dna"TCATAG"
but never generates dna"TNANAG"
or dna"T-ATAG"
.
A translatable RNASequence
can also be converted to an AminoAcidSequence
using the translate
function.
Indexing, modifying and transformations
Getindex
Sequences for the most part behave like other vector or string types. They can be indexed using integers or ranges:
julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC
julia> seq[5]
DNA_T
julia> seq[6:end]
14nt DNA Sequence:
TANAGTNNAGTACC
Note that, indexing a biological sequence by range creates a subsequence of the original sequence. Unlike Arrays
in the standard library, creating a subsequence is copy-free: a subsequence simply points to the original sequence data with its range. You may think that this is unsafe because modifying subsequences propagates to the original sequence, but this doesn't happen actually:
julia> seq = dna"AAAA" # create a sequence
4nt DNA Sequence:
AAAA
julia> subseq = seq[1:2] # create a subsequence from `seq`
2nt DNA Sequence:
AA
julia> subseq[2] = DNA_T # modify the second element of it
DNA_T
julia> subseq # the subsequence is modified
2nt DNA Sequence:
AT
julia> seq # but the original sequence is not
4nt DNA Sequence:
AAAA
This is because modifying a sequence checks whether its underlying data are shared with other sequences under the hood. If and only if the data are shared, the subsequence creates a copy of itself. Any modifying operation does this check. This is called copy-on-write strategy and users don't need to care about it because it is transparent: If the user modifies a sequence with or subsequence, the job of managing and protecting the underlying data of sequences is handled for them.
Setindex and modifying DNA sequences
The biological symbol at a given locus in a biological sequence can be set using setindex:
julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC
julia> seq[5] = DNA_A
DNA_A
In addition, many other modifying operations are possible for biological sequences such as push!
, pop!
, and insert!
, which should be familiar to people used to editing arrays.
Base.push!
— Function.push!(collection, items...) -> collection
Insert one or more items
at the end of collection
.
Examples
julia> push!([1, 2, 3], 4, 5, 6)
6-element Array{Int64,1}:
1
2
3
4
5
6
Use append!
to add all the elements of another collection to collection
. The result of the preceding example is equivalent to append!([1, 2, 3], [4, 5, 6])
.
Base.pop!
— Function.pop!(collection) -> item
Remove an item in collection
and return it. If collection
is an ordered container, the last item is returned.
Examples
julia> A=[1, 2, 3]
3-element Array{Int64,1}:
1
2
3
julia> pop!(A)
3
julia> A
2-element Array{Int64,1}:
1
2
julia> S = Set([1, 2])
Set([2, 1])
julia> pop!(S)
2
julia> S
Set([1])
julia> pop!(Dict(1=>2))
1 => 2
pop!(collection, key[, default])
Delete and return the mapping for key
if it exists in collection
, otherwise return default
, or throw an error if default
is not specified.
Examples
julia> d = Dict("a"=>1, "b"=>2, "c"=>3);
julia> pop!(d, "a")
1
julia> pop!(d, "d")
ERROR: KeyError: key "d" not found
Stacktrace:
[...]
julia> pop!(d, "e", 4)
4
pop!(sc, k)
Deletes the item with key k
in SortedDict or SortedSet sc
and returns the value that was associated with k
in the case of SortedDict or k
itself in the case of SortedSet. A KeyError
results if k
is not in sc
. Time: O(c log n)
pop!(sc, k)
Deletes the item with key k
in SortedDict or SortedSet sc
and returns the value that was associated with k
in the case of SortedDict or k
itself in the case of SortedSet. A KeyError
results if k
is not in sc
. Time: O(c log n)
pop!(ss)
Deletes the item with first key in SortedSet ss
and returns the key. A BoundsError
results if ss
is empty. Time: O(c log n)
pop!(cb)
Remove the element at the back.
pop!(seq::BioSequence)
Remove the symbol from the end of a biological sequence seq
and return it. Returns a variable of eltype(seq)
.
Base.popfirst!
— Function.popfirst!(collection) -> item
Remove the first item
from collection
.
Examples
julia> A = [1, 2, 3, 4, 5, 6]
6-element Array{Int64,1}:
1
2
3
4
5
6
julia> popfirst!(A)
1
julia> A
5-element Array{Int64,1}:
2
3
4
5
6
popfirst!(cb)
Remove the first item (at the back) from CircularBuffer.
popfirst!(seq)
Remove the symbol from the beginning of a biological sequence seq
and return it. Returns a variable of eltype(seq)
.
Base.pushfirst!
— Function.pushfirst!(collection, items...) -> collection
Insert one or more items
at the beginning of collection
.
Examples
julia> pushfirst!([1, 2, 3, 4], 5, 6)
6-element Array{Int64,1}:
5
6
1
2
3
4
pushfirst!(cb, data)
Insert one or more items at the beginning of CircularBuffer and overwrite back if full.
pushfirst!(seq, x)
Insert a biological symbol x
at the beginning of a biological sequence seq
.
Base.insert!
— Function.insert!(a::Vector, index::Integer, item)
Insert an item
into a
at the given index
. index
is the index of item
in the resulting a
.
Examples
julia> insert!([6, 5, 4, 2, 1], 4, 3)
6-element Array{Int64,1}:
6
5
4
3
2
1
insert!(sc, k)
Argument sc
is a SortedDict or SortedMultiDict, k
is a key and v
is the corresponding value. This inserts the (k,v)
pair into the container. If the key is already present in a SortedDict, this overwrites the old value. In the case of SortedMultiDict, no overwriting takes place (since SortedMultiDict allows the same key to associate with multiple values). In the case of SortedDict, the return value is a pair whose first entry is boolean and indicates whether the insertion was new (i.e., the key was not previously present) and the second entry is the semitoken of the new entry. In the case of SortedMultiDict, a semitoken is returned (but no boolean). Time: O(c log n)
insert!(sc, k)
Argument sc
is a SortedDict or SortedMultiDict, k
is a key and v
is the corresponding value. This inserts the (k,v)
pair into the container. If the key is already present in a SortedDict, this overwrites the old value. In the case of SortedMultiDict, no overwriting takes place (since SortedMultiDict allows the same key to associate with multiple values). In the case of SortedDict, the return value is a pair whose first entry is boolean and indicates whether the insertion was new (i.e., the key was not previously present) and the second entry is the semitoken of the new entry. In the case of SortedMultiDict, a semitoken is returned (but no boolean). Time: O(c log n)
insert!(sc, k)
Argument sc
is a SortedSet and k
is a key. This inserts the key into the container. If the key is already present, this overwrites the old value. (This is not necessarily a no-op; see below for remarks about the customizing the sort order.) The return value is a pair whose first entry is boolean and indicates whether the insertion was new (i.e., the key was not previously present) and the second entry is the semitoken of the new entry. Time: O(c log n)
insert!(seq, i, x)
Insert a biological symbol x
into a biological sequence seq
, at the given index i
.
Base.deleteat!
— Method.deleteat!(seq::BioSequence, i::Integer)
Delete a biological symbol at a single position i
in a biological sequence seq
.
Modifies the input sequence.
Base.append!
— Function.append!(collection, collection2) -> collection.
Add the elements of collection2
to the end of collection
.
Examples
julia> append!([1],[2,3])
3-element Array{Int64,1}:
1
2
3
julia> append!([1, 2, 3], [4, 5, 6])
6-element Array{Int64,1}:
1
2
3
4
5
6
Use push!
to add individual items to collection
which are not already themselves in another collection. The result is of the preceding example is equivalent to push!([1, 2, 3], 4, 5, 6)
.
Write part of a byte array.
append!(cb, datavec)
Push at most last capacity
items.
append!(seq, other)
Add a biological sequence other
onto the end of biological sequence seq
. Modifies and returns seq
.
Base.resize!
— Function.resize!(a::Vector, n::Integer) -> Vector
Resize a
to contain n
elements. If n
is smaller than the current collection length, the first n
elements will be retained. If n
is larger, the new elements are not guaranteed to be initialized.
Examples
julia> resize!([6, 5, 4, 3, 2, 1], 3)
3-element Array{Int64,1}:
6
5
4
julia> a = resize!([6, 5, 4, 3, 2, 1], 8);
julia> length(a)
8
julia> a[1:6]
6-element Array{Int64,1}:
6
5
4
3
2
1
resize!(seq, size)
Resize a biological sequence seq
, to a given size
.
Here are some examples:
julia> seq = dna"ACG"
3nt DNA Sequence:
ACG
julia> push!(seq, DNA_T)
4nt DNA Sequence:
ACGT
julia> append!(seq, dna"AT")
6nt DNA Sequence:
ACGTAT
julia> deleteat!(seq, 2)
5nt DNA Sequence:
AGTAT
julia> deleteat!(seq, 2:3)
3nt DNA Sequence:
AAT
Additional transformations
In addition to these basic modifying functions, other sequence transformations which are common in bioinformatics are also provided.
Base.reverse!
— Method.reverse!(seq)
Reverse a biological sequence seq
in place.
BioSequences.complement!
— Function.complement!(seq)
Make a complement sequence of seq
in place.
complement!(seq)
Transform seq
into it's complement.
BioSequences.reverse_complement!
— Function.reverse_complement!(seq)
Make a reversed complement sequence of seq
in place.
Ambiguous nucleotides are left as-is.
BioSequences.ungap!
— Function.Remove gap characters from a sequence. Modifies the input sequence.
Base.empty!
— Function.empty!(collection) -> collection
Remove all elements from a collection
.
Examples
julia> A = Dict("a" => 1, "b" => 2)
Dict{String,Int64} with 2 entries:
"b" => 2
"a" => 1
julia> empty!(A);
julia> A
Dict{String,Int64} with 0 entries
empty!(cb)
Reset the buffer.
empty!(seq)
Completely empty a biological sequence seq
of nucleotides.
Some examples:
julia> seq = dna"ACGTAT"
6nt DNA Sequence:
ACGTAT
julia> reverse!(seq)
6nt DNA Sequence:
TATGCA
julia> complement!(seq)
6nt DNA Sequence:
ATACGT
julia> reverse_complement!(seq)
6nt DNA Sequence:
ACGTAT
Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse
instead of reverse!
, and ungap
instead of ungap!
.
Translation
Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.
The translate
funtion translates a sequence of codons in a RNA sequence to a amino acid sequence besed on a genetic code mapping. The BioSequences
module contains all NCBI defined genetic codes and they are registered in ncbi_trans_table
.
BioSequences.translate
— Function.translate(rna_seq, code=standard_genetic_code, allow_ambiguous_codons=true)
Translate an RNASequence
to an AminoAcidSequence
.
Translation uses genetic code code
to map codons to amino acids. See ncbi_trans_table
for available genetic codes. If codons in the given RNA sequence cannot determine a unique amino acid, they will be translated to AA_X
if allow_ambiguous_codons
is true
and otherwise result in an error.
BioSequences.ncbi_trans_table
— Constant.Genetic code list of NCBI.
The standard genetic code is ncbi_trans_table[1]
and others can be shown by show(ncbi_trans_table)
. For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.
julia> ncbi_trans_table
Translation Tables:
1. The Standard Code (standard_genetic_code)
2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)
4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)
5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)
6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)
9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)
10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)
11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)
12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)
13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)
14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)
16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)
21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)
22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)
23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes
Site counting
BioSequences extends the Base.count
method to provide some useful utilities for counting the number of sites in biological sequences.
Site types
Different types of site can be counted. Each of these types is a concrete subtype of the abstract type Site
:
BioSequences.Certain
— Type.A Certain
site describes a site where both of two aligned sites are not an ambiguity symbol or a gap.
BioSequences.Gap
— Type.An Gap
site describes a site where either of two aligned sites are a gap symbol '-'.
BioSequences.Ambiguous
— Type.An Ambiguous
site describes a site where either of two aligned sites are an ambiguity symbol.
BioSequences.Match
— Type.A Match
site describes a site where two aligned nucleotides are the same biological symbol.
BioSequences.Mismatch
— Type.A Mismatch
site describes a site where two aligned nucleotides are not the same biological symbol.
Base.count
methods
The count method can be used with two sequences and a concrete subtype of Site
:
julia> count(Match, dna"ATCGATCG", dna"AAGGTTCG")
5
By providing a window
and step
size, counting can be done from within a sliding window:
julia> count(Match, dna"ATCGATCG", dna"AAGGTTCG", 3, 1)
6-element Array{IntervalTrees.IntervalValue{Int64,Int64},1}:
IntervalTrees.IntervalValue{Int64,Int64}
(1,3) => 1
IntervalTrees.IntervalValue{Int64,Int64}
(2,4) => 1
IntervalTrees.IntervalValue{Int64,Int64}
(3,5) => 1
IntervalTrees.IntervalValue{Int64,Int64}
(4,6) => 2
IntervalTrees.IntervalValue{Int64,Int64}
(5,7) => 2
IntervalTrees.IntervalValue{Int64,Int64}
(6,8) => 3
The pairwise_count
function
Counting can also be done on a set of sequences in a pairwise manner with the count_pairwise
function:
julia> count_pairwise(Match, dna"ATCGCCA-", dna"ATCGCCTA", dna"ATCGCCT-", dna"GTCGCCTA")
4×4 Array{Int64,2}:
0 6 7 5
6 0 7 7
7 7 0 6
5 7 6 0
Iteration
Sequences also work as iterators over symbols:
julia> n = 0
0
julia> for nt in dna"ATNGNNT"
if nt == DNA_N
global n += 1
end
end
julia> n
3
Using a more compact sequence representation
As we saw above, DNA and RNA sequences can store any ambiguous nucleotides like 'N'. If you are sure that nucleotide sequences store unambiguous nucleotides only, you can save the memory space of sequences. DNAAlphabet{2}
is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (ACGT in DNA and ACGU in RNA). To create a sequence of this alphabet, you need to explicitly pass DNAAlphabet{2}
to BioSequence
as its type parameter:
julia> seq = BioSequence{DNAAlphabet{2}}("ACGT")
4nt DNA Sequence:
ACGT
Recall that DNASequence
is a type alias of BioSequence{DNAAlphabet{4}}
, which uses four bits per base. That is, BioSequence{DNAAlphabet{2}}
saves half of memory footprint compared to BioSequence{DNAAlphabet{4}}
. If you need to handle reference genomes that are composed of five nucleotides, ACGTN, consider to use the ReferenceSequence
type described in the Reference sequences section.
Defining a new alphabet
The alphabet type parameter A
of BioSequence{A}
enables a user to extend functionality of BioSequence
with minimum effort. As an example, definition of a new alphabet type representing a sequence of boolean values is shown below:
julia> struct BoolAlphabet <: Alphabet end
julia> BioSequences.bitsof(::Type{BoolAlphabet}) = 1
julia> BioSequences.eltype(::Type{BoolAlphabet}) = Bool
julia> BioSequences.alphabet(::Type{BoolAlphabet}) = false:true
julia> function BioSequences.encode(::Type{BoolAlphabet}, x::Bool)
return UInt64(ifelse(x, 0x01, 0x00))
end
julia> function BioSequences.decode(::Type{BoolAlphabet}, x::UInt64)
if x > 0x01
throw(BioSequences.DecodeError(BoolAlphabet, x))
end
return ifelse(x == 0x00, false, true)
end