Biological symbols
The BioSequences
module reexports the biological symbol (character) types that are provided by BioSymbols.jl:
Type | Meaning |
---|---|
DNA | DNA nucleotide |
RNA | RNA nucleotide |
AminoAcid | Amino acid |
These symbols are elements of biological sequences, just as characters are elements of strings. See sections beginning from Introduction to the sequence data-types section for details.
DNA and RNA nucleotides
Set of nucleotide symbols in BioSequences covers IUPAC nucleotide base plus a gap symbol:
Symbol | Constant | Meaning |
---|---|---|
'A' | DNA_A / RNA_A | A; Adenine |
'C' | DNA_C / RNA_C | C; Cytosine |
'G' | DNA_G / RNA_G | G; Guanine |
'T' | DNA_T | T; Thymine (DNA only) |
'U' | RNA_U | U; Uracil (RNA only) |
'M' | DNA_M / RNA_M | A or C |
'R' | DNA_R / RNA_R | A or G |
'W' | DNA_W / RNA_W | A or T/U |
'S' | DNA_S / RNA_S | C or G |
'Y' | DNA_Y / RNA_Y | C or T/U |
'K' | DNA_K / RNA_K | G or T/U |
'V' | DNA_V / RNA_V | A or C or G; not T/U |
'H' | DNA_H / RNA_H | A or C or T; not G |
'D' | DNA_D / RNA_D | A or G or T/U; not C |
'B' | DNA_B / RNA_B | C or G or T/U; not A |
'N' | DNA_N / RNA_N | A or C or G or T/U |
'-' | DNA_Gap / RNA_Gap | Gap (none of the above) |
http://www.insdc.org/documents/feature_table.html#7.4.1
Symbols are accessible as constants with DNA_
or RNA_
prefix:
julia> DNA_A
DNA_A
julia> DNA_T
DNA_T
julia> RNA_U
RNA_U
julia> DNA_Gap
DNA_Gap
julia> typeof(DNA_A)
DNA
julia> typeof(RNA_A)
RNA
Symbols can be constructed by converting regular characters:
julia> convert(DNA, 'C')
DNA_C
julia> convert(DNA, 'C') === DNA_C
true
Every nucleotide is encoded using the lower 4 bits of a byte. An unambiguous nucleotide has only one set bit and the other bits are unset. The table below summarizes all unambiguous nucleotides and their corresponding bits. An ambiguous nucleotide is the bitwise OR of unambiguous nucleotides that the ambiguous nucleotide can take. For example, DNA_R
(meaning the nucleotide is either DNA_A
or DNA_G
) is encoded as 0101
because 0101
is the bitwise OR of 0001
(DNA_A
) and 0100
(DNA_G
). The gap symbol is always 0000
.
NucleicAcid | Bits |
---|---|
DNA_A , RNA_A | 0001 |
DNA_C , RNA_C | 0010 |
DNA_G , RNA_G | 0100 |
DNA_T , RNA_U | 1000 |
The next examples demonstrate bit operations of DNA:
julia> bitstring(reinterpret(UInt8, DNA_A))
"00000001"
julia> bitstring(reinterpret(UInt8, DNA_G))
"00000100"
julia> bitstring(reinterpret(UInt8, DNA_R))
"00000101"
julia> bitstring(reinterpret(UInt8, DNA_B))
"00001110"
julia> ~DNA_A
DNA_B
julia> DNA_A | DNA_G
DNA_R
julia> DNA_R & DNA_B
DNA_G
Amino acids
Set of amino acid symbols also covers IUPAC amino acid symbols plus a gap symbol:
Symbol | Constant | Meaning |
---|---|---|
'A' | AA_A | Alanine |
'R' | AA_R | Arginine |
'N' | AA_N | Asparagine |
'D' | AA_D | Aspartic acid (Aspartate) |
'C' | AA_C | Cysteine |
'Q' | AA_Q | Glutamine |
'E' | AA_E | Glutamic acid (Glutamate) |
'G' | AA_G | Glycine |
'H' | AA_H | Histidine |
'I' | AA_I | Isoleucine |
'L' | AA_L | Leucine |
'K' | AA_K | Lysine |
'M' | AA_M | Methionine |
'F' | AA_F | Phenylalanine |
'P' | AA_P | Proline |
'S' | AA_S | Serine |
'T' | AA_T | Threonine |
'W' | AA_W | Tryptophan |
'Y' | AA_Y | Tyrosine |
'V' | AA_V | Valine |
'O' | AA_O | Pyrrolysine |
'U' | AA_U | Selenocysteine |
'B' | AA_B | Aspartic acid or Asparagine |
'J' | AA_J | Leucine or Isoleucine |
'Z' | AA_Z | Glutamine or Glutamic acid |
'X' | AA_X | Any amino acid |
'*' | AA_Term | Termination codon |
'-' | AA_Gap | Gap (none of the above) |
http://www.insdc.org/documents/feature_table.html#7.4.3
Symbols are accessible as constants with AA_
prefix:
julia> AA_A
AA_A
julia> AA_Q
AA_Q
julia> AA_Term
AA_Term
julia> typeof(AA_A)
AminoAcid
Symbols can be constructed by converting regular characters:
julia> convert(AminoAcid, 'A')
AA_A
julia> convert(AminoAcid, 'P') === AA_P
true
Other functions
BioSymbols.alphabet
— Function.alphabet(DNA)
Get all symbols of DNA
in sorted order.
Examples
julia> alphabet(DNA)
(DNA_Gap, DNA_A, DNA_C, DNA_M, DNA_G, DNA_R, DNA_S, DNA_V, DNA_T, DNA_W, DNA_Y, DNA_H, DNA_K, DNA_D, DNA_B, DNA_N)
julia> issorted(alphabet(DNA))
true
alphabet(RNA)
Get all symbols of RNA
in sorted order.
Examples
julia> alphabet(RNA)
(RNA_Gap, RNA_A, RNA_C, RNA_M, RNA_G, RNA_R, RNA_S, RNA_V, RNA_U, RNA_W, RNA_Y, RNA_H, RNA_K, RNA_D, RNA_B, RNA_N)
julia> issorted(alphabet(RNA))
true
alphabet(AminoAcid)
Get all symbols of AminoAcid
in sorted order.
Examples
julia> alphabet(AminoAcid)
(AA_A, AA_R, AA_N, AA_D, AA_C, AA_Q, AA_E, AA_G, AA_H, AA_I, AA_L, AA_K, AA_M, AA_F, AA_P, AA_S, AA_T, AA_W, AA_Y, AA_V, AA_O, AA_U, AA_B, AA_J, AA_Z, AA_X, AA_Term, AA_Gap)
julia> issorted(alphabet(AminoAcid))
true
Gets the alphabet encoding of a given BioSequence.
BioSymbols.gap
— Function.gap(DNA)
Return DNA_Gap
.
gap(RNA)
Return RNA_Gap
.
gap(AminoAcid)
Return AA_Gap
.
BioSymbols.iscompatible
— Function.iscompatible(x::T, y::T) where T <: NucleicAcid
Test if x
and y
are compatible with each other (i.e. x
and y
can be the same symbol).
x
and y
must be the same type.
Examples
julia> iscompatible(DNA_A, DNA_A)
true
julia> iscompatible(DNA_C, DNA_N) # DNA_N can be DNA_C
true
julia> iscompatible(DNA_C, DNA_R) # DNA_R (A or G) cannot be DNA_C
false
iscompatible(x::AminoAcid, y::AminoAcid)
Test if x
and y
are compatible with each other.
Examples
julia> iscompatible(AA_A, AA_R)
false
julia> iscompatible(AA_A, AA_X)
true
BioSymbols.isambiguous
— Function.isambiguous(nt::NucleicAcid)
Test if nt
is an ambiguous nucleotide.
isambiguous(aa::AminoAcid)
Test if aa
is an ambiguous amino acid.