Bio.Seq: Biological Sequences
The Bio.Seq
module provides several data types for handling biological symbols and sequences.
Biological symbols
The Bio.Seq
module provides three biological symbol (character) types:
Type | Meaning |
---|---|
DNANucleotide |
DNA nucleotide |
RNANucleotide |
RNA nucleotide |
AminoAcid |
Amino acid |
These symbols can be elements of sequences like characters can be elements of strings. See sections beginning from Overview of sequences types section for details.
DNA and RNA nucleotides
Set of nucleotide symbols in Bio.jl covers IUPAC nucleotide base plus a gap symbol:
Symbol | Constant | Meaning |
---|---|---|
'A' | DNA_A / RNA_A |
A; Adenine |
'C' | DNA_C / RNA_C |
C; Cytosine |
'G' | DNA_G / RNA_G |
G; Guanine |
'T' | DNA_T |
T; Thymine (DNA only) |
'U' | RNA_U |
U; Uracil (RNA only) |
'M' | DNA_M / RNA_M |
A or C |
'R' | DNA_R / RNA_R |
A or G |
'W' | DNA_W / RNA_W |
A or T/U |
'S' | DNA_S / RNA_S |
C or G |
'Y' | DNA_Y / RNA_Y |
C or T/U |
'K' | DNA_K / RNA_K |
G or T/U |
'V' | DNA_V / RNA_V |
A or C or G; not T/U |
'H' | DNA_H / RNA_H |
A or C or T; not G |
'D' | DNA_D / RNA_D |
A or G or T/U; not C |
'B' | DNA_B / RNA_B |
C or G or T/U; not A |
'N' | DNA_N / RNA_N |
A or C or G or T/U |
'-' | DNA_Gap / RNA_Gap |
Gap (none of the above) |
http://www.insdc.org/documents/feature_table.html#7.4.1
Symbols are accessible as constants with DNA_
or RNA_
prefix:
julia> DNA_A DNA_A julia> DNA_T DNA_T julia> RNA_U RNA_U julia> DNA_Gap DNA_Gap julia> typeof(DNA_A) Bio.Seq.DNANucleotide julia> typeof(RNA_A) Bio.Seq.RNANucleotide
Symbols can be constructed by converting regular characters:
julia> convert(DNANucleotide, 'C') DNA_C julia> convert(DNANucleotide, 'C') === DNA_C true
Every nucleotide is encoded using the lower 4 bits of a byte. An unambiguous nucleotide has only one set bit and the other bits are unset. The table below summarizes all unambiguous nucleotides and their corresponding bits. An ambiguous nucleotide is the bitwise OR of unambiguous nucleotides that the ambiguous nucleotide can take. For example, DNA_R
(meaning the nucleotide is either DNA_A
or DNA_G
) is encoded as 0101
because 0101
is the bitwise OR of 0001
(DNA_A
) and 0100
(DNA_G
). The gap symbol is always 0000
.
Nucleotide | Bits |
---|---|
DNA_A , RNA_A |
0001 |
DNA_C , RNA_C |
0010 |
DNA_G , RNA_G |
0100 |
DNA_T , RNA_U |
1000 |
The next examples demonstrate bit operations of DNA:
julia> bits(reinterpret(UInt8, DNA_A)) "00000001" julia> bits(reinterpret(UInt8, DNA_G)) "00000100" julia> bits(reinterpret(UInt8, DNA_R)) "00000101" julia> bits(reinterpret(UInt8, DNA_B)) "00001110" julia> ~DNA_A DNA_B julia> DNA_A | DNA_G DNA_R julia> DNA_R & DNA_B DNA_G
Amino acids
Set of amino acid symbols also covers IUPAC amino acid symbols plus a gap symbol:
Symbol | Constant | Meaning |
---|---|---|
'A' | AA_A |
Alanine |
'R' | AA_R |
Arginine |
'N' | AA_N |
Asparagine |
'D' | AA_D |
Aspartic acid (Aspartate) |
'C' | AA_C |
Cysteine |
'Q' | AA_Q |
Glutamine |
'E' | AA_E |
Glutamic acid (Glutamate) |
'G' | AA_G |
Glycine |
'H' | AA_H |
Histidine |
'I' | AA_I |
Isoleucine |
'L' | AA_L |
Leucine |
'K' | AA_K |
Lysine |
'M' | AA_M |
Methionine |
'F' | AA_F |
Phenylalanine |
'P' | AA_P |
Proline |
'S' | AA_S |
Serine |
'T' | AA_T |
Threonine |
'W' | AA_W |
Tryptophan |
'Y' | AA_Y |
Tyrosine |
'V' | AA_V |
Valine |
'O' | AA_O |
Pyrrolysine |
'U' | AA_U |
Selenocysteine |
'B' | AA_B |
Aspartic acid or Asparagine |
'J' | AA_J |
Leucine or Isoleucine |
'Z' | AA_Z |
Glutamine or Glutamic acid |
'X' | AA_X |
Any amino acid |
'*' | AA_Term |
Termination codon |
'-' | AA_Gap |
Gap (none of the above) |
http://www.insdc.org/documents/feature_table.html#7.4.3
Symbols are accessible as constants with AA_
prefix:
julia> AA_A AA_A julia> AA_Q AA_Q julia> AA_Term AA_Term julia> typeof(AA_A) Bio.Seq.AminoAcid
Symbols can be constructed by converting regular characters:
julia> convert(AminoAcid, 'A') AA_A julia> convert(AminoAcid, 'P') === AA_P true
Other functions
#
Bio.Seq.alphabet
— Function.
alphabet(typ)
Return an iterator of symbols of typ
.
typ
is one of DNANucleotide
, RNANucleotide
, or AminoAcid
.
#
Bio.Seq.gap
— Function.
gap(typ)
Return the gap symbol of typ
.
typ
is one of DNANucleotide
, RNANucleotide
, AminoAcid
, or Char
.
#
Bio.Seq.iscompatible
— Function.
iscompatible(x, y)
Return true
if and only if x
and y
are compatible with each other (i.e. x
and y
can be the same symbol).
x
and y
must be the same type (DNANucleotide
, RNANucleotide
or AminoAcid
).
Examples
julia> iscompatible(DNA_A, DNA_A) true julia> iscompatible(DNA_C, DNA_N) # DNA_N can be DNA_C true julia> iscompatible(DNA_C, DNA_R) # DNA_R (A or G) cannot be DNA_C false julia> iscompatible(AA_A, AA_X) # AA_X can be AA_A true
#
Bio.Seq.isambiguous
— Function.
isambiguous(nt::Nucleotide)
Test if nt
is ambiguous nucleotide.
Overview of sequences types
The Bio.Seq
module provides representations and tools for manipulating nucleotide and amino acid sequences. Sequences in Bio.jl are more strictly typed than in many other libraries; elements in a sequence are typed as biological symbol instead of character or byte. They are special purpose types rather than simply strings and hence offer additional functionality that naive string types don't have. Though this strictness sacrifices some convenience, it also means you can always rely on a DNA sequence type to store DNA and nothing but DNA, without having to check, or deal with lowercase versus uppercase and so on. Strict separation of sequence types also means we are free to choose the most efficient representation. DNA and RNA sequences are encoded using four bits per base by default making them memory efficient, and also allowing us to speed up many common operations like nucleotide composition, reverse complement, and k-mer enumeration.
The Bio.Seq
provides three different sequence types: BioSequence
, Kmer
and ReferenceSequence
. Each of these types is a subtype of an abstract type called Sequence
and supports various string-like operations such as random access and iteration. Different sequence types have different features. In most situations, BioSequence
type will do and is used as the default representation. But sometimes other types are much more preferable in terms of memory efficiency and computation performance. Here is the summary table of these three types:
Type | Description | Element type | Mutability | Allocation |
---|---|---|---|---|
BioSequence{A<:Alphabet} |
general-purpose biological sequences | DNA, RNA, Amino acids | mutable | heap |
Kmer{T<:Nucleotide,k} |
specialized for short nucleotide sequences | DNA, RNA | immutable | stack / register |
ReferenceSequence |
specialized for long reference genomes | DNA | immutable | heap |
Details of these different representations are explained in the following sections:
BioSequence
: General-purpose sequencesKmer
: Nucleotide k-mersReferenceSequence
: Reference sequences
General-purpose sequences
BioSequence{A}
is a generic sequence type parameterized by an alphabet type A
that defines the domain (or set) of biological symbols, and each alphabet has an associated symbol type. For example, AminoAcidAlphabet
is associated with AminoAcid
and hence an object of the BioSequence{AminoAcidAlphabet}
type represents a sequence of amino acids. Symbols from multiple alphabets can't be intermixed in one sequence type.
The following table summarizes common sequence types that are defined in the Bio.Seq
module:
Type | Symbol type | Type alias |
---|---|---|
BioSequence{DNAAlphabet{4}} |
DNANucleotide |
DNASequence |
BioSequence{RNAAlphabet{4}} |
RNANucleotide |
RNASequence |
BioSequence{AminoAcidAlphabet} |
AminoAcid |
AminoAcidSequence |
BioSequence{CharAlphabet} |
Char |
CharSequence |
Parameterized definition of the BioSequence{A}
type is for the purpose of unifying the data structure and operations of any symbol type. In most cases, users don't have to care about it and can use type aliases listed above. However, the alphabet type fixes the internal memory encoding and plays an important role when optimizing performance of a program (see Compact representation section for low-memory encodings). It also enables a user to define their own alphabet only by defining few numbers of methods. This is described in Defining a new alphabet section.
Constructing sequences
Sequence types corresponding to these alphabets can be constructed a number of different ways. Most immediately, sequence literals can be constructed using the string macros dna
, rna
, aa
, and char
:
# String decorators are provided for common sequence types julia> dna"TACGTANNATC" 11nt DNA Sequence: TACGTANNATC julia> rna"AUUUGNCCANU" 11nt RNA Sequence: AUUUGNCCANU julia> aa"ARNDCQEGHILKMFPSTWYVX" 21aa Amino Acid Sequence: ARNDCQEGHILKMFPSTWYVX julia> char"αβγδϵ" 5char Char Sequence: αβγδϵ
However it should be noted that by default these sequence literals allocate the BioSequence
object before the code containing the sequence literal is run. This means there may be occasions where your program does not behave as you first expect, even though it is the intended behaviour. For example consider the following code:
julia> function foo() s = dna"CTT" push!(s, DNA_A) end foo (generic function with 1 method)
You might expect that every time you call foo
, that a DNA sequence CTTA
would be returned. You might expect that this is because every time foo
is called, a new DNA sequence variable CTT
is created, and and A
nucleotide is pushed to it, and the result, CTTA
is returned. In other words you might expect the following output:
julia> foo() 4nt DNA Sequence: CTTA julia> foo() 4nt DNA Sequence: CTTA julia> foo() 4nt DNA Sequence: CTTA
However, this is not what happens, instead the following happens:
julia> foo() 4nt DNA Sequence: CTTA julia> foo() 5nt DNA Sequence: CTTAA julia> foo() 6nt DNA Sequence: CTTAAA
The reason for this is because the sequence literal is allocated only once before the first time the function foo
is called and run. Therefore, s
in foo
is always a reference to that one sequence that was allocated. So one sequence is created before foo
is called, and then it is pushed to every time foo
is called. Thus, that one allocated sequence grows with every call of foo
.
If you wanted foo
to create a new sequence each time it is called, then you can add a flag to the end of the sequence literal to dictate behaviour: A flag of 's' means 'static': the sequence will be allocated before code is run, as is the default behaviour described above. However providing 'd' flag changes the behaviour: 'd' means 'dynamic': the sequence will be allocated at whilst the code is running, and not before. So to change foo
so as it creates a new sequence each time it is called, simply add the 'd' flag to the sequence literal:
julia> function foo() s = dna"CTT"d # 'd' flag appended to the string literal. push!(s, DNA_A) end foo (generic function with 1 method)
Now every time foo
is called, a new sequence CTT
is created, and an A
nucleotide is pushed to it:
julia> foo() 4nt DNA Sequence: CTTA julia> foo() 4nt DNA Sequence: CTTA julia> foo() 4nt DNA Sequence: CTTA
So the take come message of sequence literals is: Be careful when you are using sequence literals inside of functions, and inside the bodies of things like for loops. And if you use them and are unsure, use the 's' and 'd' flags to ensure the behaviour you get is the behaviour you intend.
Sequences can also be constructed from strings or arrays of nucleotide or amino acid symbols using constructors or the convert
function:
julia> DNASequence("TTANC") 5nt DNA Sequence: TTANC julia> DNASequence([DNA_T, DNA_T, DNA_A, DNA_N, DNA_C]) 5nt DNA Sequence: TTANC julia> convert(DNASequence, [DNA_T, DNA_T, DNA_A, DNA_N, DNA_C]) 5nt DNA Sequence: TTANC
Using convert
, these operations are reversible: sequences can be converted to strings or arrays:
julia> convert(String, dna"TTANGTA") "TTANGTA" julia> convert(Vector{DNANucleotide}, dna"TTANGTA") 7-element Array{Bio.Seq.DNANucleotide,1}: DNA_T DNA_T DNA_A DNA_N DNA_G DNA_T DNA_A
Sequences can also be concatenated into longer sequences:
julia> DNASequence(dna"ACGT", dna"NNNN", dna"TGCA") 12nt DNA Sequence: ACGTNNNNTGCA julia> dna"ACGT" * dna"TGCA" 8nt DNA Sequence: ACGTTGCA julia> repeat(dna"TA", 10) 20nt DNA Sequence: TATATATATATATATATATA julia> dna"TA" ^ 10 20nt DNA Sequence: TATATATATATATATATATA
Despite being separate types, DNASequence
and RNASequence
can freely be converted between efficiently without copying the underlying data:
julia> dna = dna"TTANGTAGACCG" 12nt DNA Sequence: TTANGTAGACCG julia> rna = convert(RNASequence, dna) 12nt RNA Sequence: UUANGUAGACCG julia> dna.data === rna.data # underlying data are same true
A random sequence can be obtained by the randdnaseq
, randrnaseq
and randaaseq
functions, which generate DNASequence
, RNASequence
and AminoAcidSequence
, respectively. Generated sequences are composed of the standard symbols without ambiguity and gap. For example, randdnaseq(6)
may generate dna"TCATAG"
but never generates dna"TNANAG"
or dna"T-ATAG"
.
A translatable RNASequence
can also be converted to an AminoAcidSequence
using the translate
function described below.
Indexing and modifying
Sequences for the most part behave like other vector or string types. They can be indexed using integers or ranges:
julia> seq = dna"ACGTTTANAGTNNAGTACC" 19nt DNA Sequence: ACGTTTANAGTNNAGTACC julia> seq[5] DNA_T julia> seq[6:end] 14nt DNA Sequence: TANAGTNNAGTACC
Indexing by range creates a subsequence of the original sequence. Unlike Vector
in the standard library, creating a subsequences is copy-free: a subsequence is just a reference to the original sequence with its range. You may think that this is unsafe because modifying subsequences propagates to the original sequence, but this doesn't happen actually:
julia> seq = dna"AAAA" # create a sequence 4nt DNA Sequence: AAAA julia> subseq = seq[1:2] # create a subsequence from `seq` 2nt DNA Sequence: AA julia> subseq[2] = DNA_T # modify the second element of it DNA_T julia> subseq # the subsequence is modified 2nt DNA Sequence: AT julia> seq # but the original sequence is not 4nt DNA Sequence: AAAA
This is because modifying a sequence checks whether its underlying data are shared with other sequences under the hood. If and only if the data are shared, the subsequence creates a copy of itself. Any modifying operation does this check. This is called copy-on-write strategy and users don't need to care about it because it is transparent from outward.
The following modifying operations are currently supported:
setindex!(seq, item, index) push!(seq, item) pop!(seq) shift!(seq) unshift!(seq, item) insert!(seq, index, item) deleteat!(seq, index) append!(seq, other_seq) copy!(dst_seq, dest_offset, src_seq, src_offset, len) reverse!(seq)
And these two are functions for nucleotide sequences:
complement!(seq) reverse_complement!(seq)
julia> seq = dna"ACG" 3nt DNA Sequence: ACG julia> push!(seq, DNA_T) 4nt DNA Sequence: ACGT julia> append!(seq, dna"AT") 6nt DNA Sequence: ACGTAT julia> reverse!(seq) 6nt DNA Sequence: TATGCA julia> complement!(seq) 6nt DNA Sequence: ATACGT julia> reverse_complement!(seq) 6nt DNA Sequence: ACGTAT
Sequences also work as iterators over symbols:
julia> n = 0 0 julia> for nt in dna"ATNGNNT" if nt == DNA_N n += 1 end end julia> n 3
Other operations on sequences
A number of common sequence operations are provided in the Bio.Seq
module:
#
Bio.Seq.complement
— Function.
complement(nt::Nucleotide)
Return the complementary nucleotide of nt
.
complement(seq)
Make a complement sequence of seq
.
complement(kmer::Kmer)
Return the complement of kmer
.
#
Bio.Seq.reverse_complement
— Function.
reverse_complement(seq)
Make a reversed complement sequence of seq
.
Ambiguous nucleotides are left as-is.
reverse_complement(kmer::Kmer)
Return the reverse complement of kmer
#
Bio.Seq.mismatches
— Function.
mismatches(seq1::BioSequence, seq2::BioSequence[, compatible=false])
Return the number of mismatches between seq1
and seq2
.
If seq1
and seq2
are of differing lengths, only the first min(length(seq1), length(seq2))
nucleotides are compared. When compatible
is true
, sequence symbols are comapred using iscompatible
; otherwise using ==
.
mismatches(a::Kmer, b::Kmer)
Return the number of mismatches between a
and b
.
#
Bio.Seq.composition
— Function.
composition(seq | kmer_iter)
Calculate composition of biological symbols in seq
or k-mers in kmer_iter
.
#
Bio.Seq.seqmatrix
— Function.
seqmatrix{A<:Alphabet}(vseq::Vector{BioSequence{A}}, major::Symbol)
Construct a matrix of nucleotides or amino acids from a vector of BioSequence
s.
If parameter major
is set to :site
, the matrix is created such that one nucleotide from each sequence is placed in each column i.e. the matrix is laid out in site-major order. This means that iteration over one position of many sequences is efficient, as julia arrays are laid out in column major order.
If the parameter major
is set to :seq
, the matrix is created such that each sequence is placed in one column i.e. the matrix is laid out in sequence-major order. This means that iteration across each sequence in turn is efficient, as julia arrays are laid out in column major order.
Examples
julia> seqs = [dna"AAA", dna"TTT", dna"CCC", dna"GGG"] 4-element Array{Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}},1}: 3nt DNA Sequence: AAA 3nt DNA Sequence: TTT 3nt DNA Sequence: CCC 3nt DNA Sequence: GGG julia> seqmatrix(seqs, :site) 4x3 Array{Bio.Seq.DNANucleotide,2}: DNA_A DNA_A DNA_A DNA_T DNA_T DNA_T DNA_C DNA_C DNA_C DNA_G DNA_G DNA_G julia> seqmatrix(seqs, :seq) 3x4 Array{Bio.Seq.DNANucleotide,2}: DNA_A DNA_T DNA_C DNA_G DNA_A DNA_T DNA_C DNA_G DNA_A DNA_T DNA_C DNA_G
seqmatrix{A<:Alphabet,T}(::Type{T}, vseq::Vector{BioSequence{A}}, major::Symbol)
Construct a matrix of T
from a vector of BioSequence
s.
If parameter major
is set to :site
, the matrix is created such that one nucleotide from each sequence is placed in each column i.e. the matrix is laid out in site-major order. This means that iteration over one position of many sequences is efficient, as julia arrays are laid out in column major order.
If the parameter major
is set to :seq
, the matrix is created such that each sequence is placed in one column i.e. the matrix is laid out in sequence-major order. This means that iteration across each sequence in turn is efficient, as julia arrays are laid out in column major order.
Examples
julia> seqs = [dna"AAA", dna"TTT", dna"CCC", dna"GGG"] 4-element Array{Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}},1}: 3nt DNA Sequence: AAA 3nt DNA Sequence: TTT 3nt DNA Sequence: CCC 3nt DNA Sequence: GGG julia> seqmatrix(seqs, :site, UInt8) 4×3 Array{UInt8,2}: 0x01 0x01 0x01 0x08 0x08 0x08 0x02 0x02 0x02 0x04 0x04 0x04 julia> seqmatrix(seqs, :seq, UInt8) 3×4 Array{UInt8,2}: 0x01 0x08 0x02 0x04 0x01 0x08 0x02 0x04 0x01 0x08 0x02 0x04
Translation
The translate
funtion translates a sequence of codons in a RNA sequence to a amino acid sequence besed on a genetic code mapping. The Bio.Seq
module contains all NCBI defined genetic codes and they are registered in ncbi_trans_table
.
#
Bio.Seq.translate
— Function.
translate(rna_seq, code=standard_genetic_code, allow_ambiguous_codons=true)
Translate an RNASequence
to an AminoAcidSequence
.
Translation uses genetic code code
to map codons to amino acids. See ncbi_trans_table
for available genetic codes. If codons in the given RNA sequence cannot determine a unique amino acid, they will be translated to AA_X
if allow_ambiguous_codons
is true
and otherwise result in an error.
#
Bio.Seq.ncbi_trans_table
— Constant.
Genetic code list of NCBI.
The standard genetic code is ncbi_trans_table[1]
and others can be shown by show(ncbi_trans_table)
. For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.
julia> ncbi_trans_table Translation Tables: 1. The Standard Code (standard_genetic_code) 2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code) 3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code) 4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code) 5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code) 6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code) 9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code) 10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code) 11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code) 12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code) 13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code) 14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code) 16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code) 21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code) 22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code) 23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code) 24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code) 25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes
Compact representation
As we saw above, DNA and RNA sequences can store any ambiguous nucleotides like 'N'. If you are sure that nucleotide sequences store unambiguous nucleotides only, you can save the memory space of sequences. DNAAlphabet{2}
is an alphabet that uses two bits per base and limits to only unambiguous nucleotide symbols (ACGT in DNA and ACGU in RNA). To create a sequence of this alphabet, you need to explicitly pass DNAAlphabet{2}
to BioSequence
as its parametric type:
julia> seq = BioSequence{DNAAlphabet{2}}("ACGT") 4nt DNA Sequence: ACGT
Recall that DNASequence
is a type alias of BioSequence{DNAAlphabet{4}}
, which uses four bits per base. That is, BioSequence{DNAAlphabet{2}}
saves half of memory footprint compared to BioSequence{DNAAlphabet{4}}
. If you need to handle reference genomes that are composed of five nucleotides, ACGTN, consider to use the ReferenceSequence
type described in the Reference sequences section.
Defining a new alphabet
The alphabet type parameter A
of BioSequence{A}
enables a user to extend functionality of BioSequence
with mimimum effort. As an example, definition of a new alphabet type representing a sequence of boolean values is shown below:
using Bio.Seq immutable BoolAlphabet <: Alphabet end Seq.bitsof(::Type{BoolAlphabet}) = 1 Seq.eltype(::Type{BoolAlphabet}) = Bool Seq.alphabet(::Type{BoolAlphabet}) = false:true function Seq.encode(::Type{BoolAlphabet}, x::Bool) return UInt64(ifelse(x, 0x01, 0x00)) end function Seq.decode(::Type{BoolAlphabet}, x::UInt64) if x > 0x01 throw(Seq.DecodeError(BoolAlphabet, x)) end return ifelse(x == 0x00, false, true) end
Nucleotide k-mers
A common strategy to simplify the analysis of sequence data is to operate or short k-mers, for size fixed size k
. These can be packed into machine integers allowing extremely efficient code. The Bio.Seq
module has built in support for representing short sequences in 64-bit integers. Besides being fixed length, Kmer
types, unlike other sequence types cannot contain ambiguous symbols like 'N'.
The Kmer{T,k}
type parameterized on symbol type (T
, either DNANucleotide
, or RNANucleotide
) and size k
. For ease of writing code, two type aliases for each nucleotide type are defined and named as DNAKmer{k}
and RNAKmer{k}
:
julia> DNAKmer("ACGT") # create a DNA 4-mer from a string DNA 4-mer: ACGT julia> RNAKmer("ACGU") # create an RNA 4-mer from a string RNA 4-mer: ACGU julia> typeof(DNAKmer("ACGT")) Bio.Seq.Kmer{Bio.Seq.DNANucleotide,4}
#
Bio.Seq.each
— Function.
each(::Type{Kmer{T,k}}, seq::Sequence[, step=1])
Initialize an iterator over all k-mers in a sequence seq
skipping ambiguous nucleotides without changing the reading frame.
Arguments
Kmer{T,k}
: k-mer type to enumerate.seq
: a nucleotide sequence.step=1
: the number of positions between iterated k-mers
Examples
# iterate over DNA codons for (pos, codon) in each(DNAKmer{3}, dna"ATCCTANAGNTACT", 3) @show pos, codon end
#
Bio.Seq.canonical
— Function.
canonical(kmer::Kmer)
Return the canonical k-mer of x
.
A canonical k-mer is the numerical lesser of a k-mer and its reverse complement. This is useful in hashing/counting k-mers in data that is not strand specific, and thus observing k-mer is equivalent to observing its reverse complement.
#
Bio.Seq.neighbors
— Function.
neighbors(kmer::Kmer)
Return an iterator through k-mers neighboring kmer
on a de Bruijn graph.
Reference sequences
DNASequence
(alias of BioSequence{DNAAlphabet{4}}
) is a flexible data sturcture but always consumes 4 bits per base, which will waste a large part of the memory space when storing reference genome sequences. In such a case, ReferenceSequence
is helpful because it compresses positions of 'N' symbols so that long DNA sequences are stored with almost 2 bits per base. An important limitation is that the ReferenceSequence
type is immutable due to the compression. Other sequence-like operations are supported:
julia> seq = ReferenceSequence(dna"NNCGTATTTTCN") 12nt Reference Sequence: NNCGTATTTTCN julia> seq[1] DNA_N julia> seq[5] DNA_T julia> seq[2:6] 5nt Reference Sequence: NCGTA julia> ReferenceSequence(dna"ATGM") # DNA_M is not accepted ERROR: ArgumentError: invalid symbol M ∉ {A,C,G,T,N} at 4 in convert at /Users/kenta/.julia/v0.4/Bio/src/seq/refseq.jl:58 in call at essentials.jl:56
When reading reference sequences from a FASTA file, the following snippet will avoid allocating temporary sequences and conversion:
for record in open(FASTAReader{ReferenceSequence}, "hg38.fa") # do something end
Sequence search
Three kinds of on-line search functions are provided:
- Exact search
- Approximate search
- Regular expression search
These are all specialized for biological sequences and ambiguities of symbols are considered.
Exact search
Exact search functions search for an occurrence of the query symbol or sequence. Four functions, search
, searchindex
, rsearch
, and rsearchindex
are available:
julia> seq = dna"ACAGCGTAGCT"; julia> search(seq, DNA_G) # search a query symbol 4:4 julia> query = dna"AGC"; julia> search(seq, query) # search a query sequence 3:5 julia> searchindex(seq, query) 3 julia> rsearch(seq, query) # similar to `search` but in the reverse direction 8:10 julia> rsearchindex(seq, query) # similar to `searchindex` but in the reverse direction 8
These search functions take ambiguous symbols into account. That is, if two symbols are compatible (e.g. DNA_A
and DNA_N
), they match when searching an occurrence. In the following example, 'N' is a wild card that matches any symbols:
julia> search(dna"ACNT", DNA_N) # 'A' matches 'N' 1:1 julia> search(dna"ACNT", dna"CGT") # 'N' matches 'G' 2:4 julia> search(dna"ACGT", dna"CNT") # 'G' matches 'N' 2:4
The exact sequence search needs preprocessing phase of query sequence before searching phase. This would be enough fast for most search applications. But when searching a query sequence to large amounts of target sequences, caching the result of preprocessing may save time. The ExactSearchQuery
creates such a preprocessed query object and is applicable to the search functions:
julia> query = ExactSearchQuery(dna"ATT"); julia> search(dna"ATTTATT", query) 1:3 julia> rsearch(dna"ATTTATT", query) 5:7
Approximate search
The approximate search is similar to the exact search but allows a specific number of errors. That is, it tries to find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence:
julia> seq = dna"ACAGCGTAGCT"; julia> approxsearch(seq, dna"AGGG", 0) # nothing matches with no errors 0:-1 julia> approxsearch(seq, dna"AGGG", 1) # seq[3:5] matches with one error 3:6 julia> approxsearch(seq, dna"AGGG", 2) # seq[1:4] matches with two errors 1:4
Like the exact search functions, four kinds of functions (approxsearch
, approxsearchindex
, approxrsearch
, and approxrsearchindex
) are available:
julia> seq = dna"ACAGCGTAGCT"; pat = dna"AGGG"; julia> approxsearch(seq, pat, 2) # return the range (forward) 1:4 julia> approxsearchindex(seq, pat, 2) # return the starting index (forward) 1 julia> approxrsearch(seq, pat, 2) # return the range (backward) 8:11 julia> approxrsearchindex(seq, pat, 2) # return the starting index (backward) 8
Preprocessing can be cached in an ApproximateSearchQuery
object:
julia> query = ApproximateSearchQuery(dna"AGGG"); julia> approxsearch(dna"AAGAGG", query, 1) 2:5 julia> approxsearch(dna"ACTACGT", query, 2) 4:6
Regular expression search
Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.
The Perl-like syntax starts with biore
(biological regular expression) and ends with a symbol option: "dna", "rna" or "aa". For example, biore"A+"dna
is a regular expression for DNA sequences and biore"A+"aa
is for amino acid sequences. The symbol options can be abbreviated to its first character: "d", "r" or "a", respectively.
Here are examples of using the regular expression for BioSequence
s:
julia> match(biore"A+C*"dna, dna"AAAACC") Nullable(RegexMatch("AAAACC")) julia> match(biore"A+C*"d, dna"AAAACC") Nullable(RegexMatch("AAAACC")) julia> ismatch(biore"A+C*"dna, dna"AAC") true julia> ismatch(biore"A+C*"dna, dna"C") false
match
always returns a Nullable
object and it should be null if no match is found.
The table below summarizes available syntax elements.
Syntax | Description | Example |
---|---|---|
\| |
alternation | "A\|T" matches "A" and "T" |
* |
zero or more times repeat | "TA*" matches "T" , "TA" and "TAA" |
+ |
one or more times repeat | "TA+" matches "TA" and "TAA" |
? |
zero or one time | "TA?" matches "T" and "TA" |
{n,} |
n or more times repeat |
"A{3,}" matches "AAA" and "AAAA" |
{n,m} |
n -m times repeat |
"A{3,5}" matches "AAA" , "AAAA" and "AAAAA" |
^ |
the start of the sequence | "^TAN*" matches "TATGT" |
$ |
the end of the sequence | "N*TA$" matches "GCTA" |
(...) |
pattern grouping | "(TA)+" matches "TA" and "TATA" |
[...] |
one of symbols | "[ACG]+" matches "AGGC" |
eachmatch
, matchall
, and search
are also defined like usual strings:
julia> matchall(biore"TATA*?"d, dna"TATTATAATTA") # overlap (default) 4-element Array{Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}},1}: 3nt DNA Sequence: TAT 3nt DNA Sequence: TAT 4nt DNA Sequence: TATA 5nt DNA Sequence: TATAA julia> matchall(biore"TATA*"d, dna"TATTATAATTA", false) # no overlap 2-element Array{Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}},1}: 3nt DNA Sequence: TAT 5nt DNA Sequence: TATAA julia> search(dna"TATTATAATTA", biore"TATA*"d) 1:3 julia> search(dna"TATTATAATTA", biore"TATA*"d, 2) 4:8
Notewothy differences from strings are:
- Ambiguous characters match any compatible characters (e.g.
biore"N"d
is equivalent tobiore"[ACGT]"d
). - Whitespaces are ignored (e.g.
biore"A C G"d
is equivalent tobiore"ACG"d
).
The PROSITE notation is described in ScanProsite - user manual. The syntax supports almost all notations including the extended syntax. The PROSITE notation starts with prosite
prefix and no symbol option is needed because it always descirbe patterns of amino acid sequences:
julia> match(prosite"[AC]-x-V-x(4)-{ED}", aa"CPVPQARG") Nullable(RegexMatch("CPVPQARG")) julia> match(prosite"[AC]xVx(4){ED}", aa"CPVPQARG") Nullable(RegexMatch("CPVPQARG"))
Sequence composition
Sequence composition can be easily calculated using the composition
function:
julia> comp = composition(dna"ACGAG") DNA Nucleotide Composition: DNA_A => 2 DNA_C => 1 DNA_G => 2 DNA_T => 0 DNA_M => 0 DNA_R => 0 DNA_W => 0 DNA_S => 0 DNA_Y => 0 DNA_K => 0 DNA_V => 0 DNA_H => 0 DNA_D => 0 DNA_B => 0 DNA_N => 0 DNA_Gap => 0 julia> comp[DNA_A] 2 julia> comp[DNA_T] 0
To accumulate composition statistics of multiple sequences, merge!
can be used as follows:
# initiaize an empty composition counter comp = composition(dna"") # iterate over sequences and accumulate composition statistics into `comp` for seq in seqs merge!(comp, composition(seq)) end # or functional programming style in one line foldl((x, y) -> merge(x, composition(y)), composition(dna""), seqs)
composition
is also applicable to a k-mer iterator:
julia> comp = composition(each(DNAKmer{4}, dna"ACGT"^100)); julia> comp[DNAKmer("ACGT")] 100 julia> comp[DNAKmer("CGTA")] 99
Sequence records
The SeqRecord
type is used to represent a named sequence, optionally with accompanying metadata. It is defined as: sequence. It is as:
type SeqRecord{S, T} name::String seq::S metadata::T end
The type of the metadata
field depends on the source of the sequence record. For example, if a record is read from a FASTA file, metadata contains the description field. If from a FASTQ file, a quality scores assigned to base calls during sequencing.
The following accessors are defined for the SeqRecord
type:
#
Bio.Seq.seqname
— Function.
seqname(record)
Return the sequence name of record
.
#
Bio.Seq.sequence
— Function.
sequence(record)
Return the sequence of record
.
#
Bio.Seq.metadata
— Function.
metadata(record)
Return the metadata of record
.
Sequence demultiplexing
Multiplex sequencing is a technology to sequence mutliple samples at the same time on a high-throughput DNA sequencer. Samples are distinguished by the short prefix of a DNA sequence called DNA barcode. The Bio.Seq
offers the Demultiplexer
type and the demultiplex
function to identify the DNA barcode of a longer DNA sequence allowing small errors.
In the following example, four kinds of DNA sequences of length 4 are used as DNA barcodes. Demultiplexer
takes these barcodes as its first argument with a few options:
julia> barcodes = DNASequence["ATGG", "CAGA", "GGAA", "TACG"]; julia> dplxr = Demultiplexer(barcodes, n_max_errors=1, distance=:hamming) Bio.Seq.Demultiplexer{Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}}}: distance: hamming number of barcodes: 4 number of correctable errors: 1
n_max_errors
specifies the number of maximum correctable errors in a barcode. The type of correctable errors depends on the distance
parameter. When distance = :hamming
as shown above only substitutions are correctable. When distance = :levenshtein
substitutions, deletions, and insertions are correctable. The user is responsible for keeping enough distances among barcodes; Demultiplexer
will throw an exception if two barcodes are within n_max_errors * 2
.
The demultiplex
function takes a demultiplexer object and a DNA sequence, and returns a tuple of a barcode index and a distance between the original barcode sequence and the prefix sequence:
julia> demultiplex(dplxr, dna"ATGGCGNT") # 1st barcode with no errors (1,0) julia> demultiplex(dplxr, dna"CAGGCGNT") # 2nd barcode with one error (2,1) julia> demultiplex(dplxr, dna"GGAACGNT") # 3rd barcode with no errors (3,0) julia> demultiplex(dplxr, dna"TGACCGNT") # no matching barcode (0,-1)
The optional third argument controls the search strategy. demultiplex
uses an index to search the closest barcode within n_max_errors
in the barcode set and returns it if any by default. If the third argument is true
it falls back to a linear search after the index search and returns one of the closest barcodes at random. The next example shows the difference of these two strategies:
julia> demultiplex(dplxr, dna"TGACCGNT", false) # linear search off (default) (0,-1) julia> demultiplex(dplxr, dna"TGACCGNT", true) # linear search on (3,2)