Indexing & modifying sequences
Indexing
Most BioSequence
concrete subtypes for the most part behave like other vector or string types. They can be indexed using integers or ranges:
For example, with LongSequence
s:
julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC
julia> seq[5]
DNA_T
julia> seq[6:end]
14nt DNA Sequence:
TANAGTNNAGTACC
The biological symbol at a given locus in a biological sequence can be set using setindex:
julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC
julia> seq[5] = DNA_A
DNA_A
Some types such can be indexed using integers but not using ranges.
For LongSequence
types, indexing a sequence by range creates a copy of the original sequence, similar to Array
in Julia's Base
library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.
Modifying sequences
In addition to setindex
, many other modifying operations are possible for biological sequences such as push!
, pop!
, and insert!
, which should be familiar to anyone used to editing arrays.
Base.push!
— Methodpush!(seq::BioSequence, x)
Append a biological symbol x
to a biological sequence seq
.
Base.pop!
— Methodpop!(seq::BioSequence)
Remove the symbol from the end of a biological sequence seq
and return it. Returns a variable of eltype(seq)
.
Base.pushfirst!
— Methodpushfirst!(seq, x)
Insert a biological symbol x
at the beginning of a biological sequence seq
.
Base.popfirst!
— Methodpopfirst!(seq)
Remove the symbol from the beginning of a biological sequence seq
and return it. Returns a variable of eltype(seq)
.
Base.insert!
— Methodinsert!(seq::BioSequence, i, x)
Insert a biological symbol x
into a biological sequence seq
, at the given index i
.
Base.deleteat!
— Methoddeleteat!(seq::BioSequence, i::Integer)
Delete a biological symbol at a single position i
in a biological sequence seq
.
Modifies the input sequence.
Base.append!
— Methodappend!(seq, other)
Add a biological sequence other
onto the end of biological sequence seq
. Modifies and returns seq
.
Base.resize!
— Methodresize!(seq, size, [force::Bool])
Resize a biological sequence seq
, to a given size
. Does not resize the underlying data array unless the new size does not fit. If force
, always resize underlying data array.
Base.empty!
— Methodempty!(seq::BioSequence)
Completely empty a biological sequence seq
of nucleotides.
Here are some examples:
julia> seq = dna"ACG"
3nt DNA Sequence:
ACG
julia> push!(seq, DNA_T)
4nt DNA Sequence:
ACGT
julia> append!(seq, dna"AT")
6nt DNA Sequence:
ACGTAT
julia> deleteat!(seq, 2)
5nt DNA Sequence:
AGTAT
julia> deleteat!(seq, 2:3)
3nt DNA Sequence:
AAT
Additional transformations
In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.
Base.reverse!
— Methodreverse!(seq::LongSequence)
Reverse a biological sequence seq
in place.
Base.reverse
— Methodreverse(seq::BioSequence)
Create reversed copy of a biological sequence.
reverse(seq::LongSequence)
Create reversed copy of a biological sequence.
BioSequences.complement!
— Functioncomplement!(seq)
Make a complement sequence of seq
in place.
BioSymbols.complement
— Functioncomplement(nt::NucleicAcid)
Return the complementary nucleotide of nt
.
This function returns the union of all possible complementary nucleotides.
Examples
julia> complement(DNA_A)
DNA_T
julia> complement(DNA_N)
DNA_N
julia> complement(RNA_U)
RNA_A
complement(seq)
Make a complement sequence of seq
.
BioSequences.reverse_complement!
— Functionreverse_complement!(seq)
Make a reversed complement sequence of seq
in place.
BioSequences.reverse_complement
— Functionreverse_complement(seq)
Make a reversed complement sequence of seq
.
BioSequences.ungap!
— FunctionRemove gap characters from an input sequence.
BioSequences.ungap
— FunctionCreate a copy of a sequence with gap characters removed.
BioSequences.canonical!
— Functioncanonical!(seq::NucleotideSeq)
Transforms the seq
into its canonical form, if it is not already canonical. Modifies the input sequence inplace.
For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:
------->
ATCGATCG
CGATCGAT
<-------
Using the reverse_complement
of a DNA sequence will give give this reverse complement.
Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq
.
Using this function on a seq
will ensure it is the canonical version.
BioSequences.canonical
— Functioncanonical(seq::NucleotideSeq)
Create the canonical sequence of seq
.
Some examples:
julia> seq = dna"ACGTAT"
6nt DNA Sequence:
ACGTAT
julia> reverse!(seq)
6nt DNA Sequence:
TATGCA
julia> complement!(seq)
6nt DNA Sequence:
ATACGT
julia> reverse_complement!(seq)
6nt DNA Sequence:
ACGTAT
Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse
instead of reverse!
, and ungap
instead of ungap!
.
Translation
Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.
The translate
function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences
package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table
.
BioSequences.translate
— Functiontranslate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)
Translate an LongRNA
or a LongDNA
to an LongAA
.
Translation uses genetic code code
to map codons to amino acids. See ncbi_trans_table
for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X
if allow_ambiguous_codons
is true
and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true
, in which case the first codon will always be converted to a methionine.
BioSequences.ncbi_trans_table
— ConstantGenetic code list of NCBI.
The standard genetic code is ncbi_trans_table[1]
and others can be shown by show(ncbi_trans_table)
. For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.
julia> ncbi_trans_table
Translation Tables:
1. The Standard Code (standard_genetic_code)
2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)
4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)
5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)
6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)
9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)
10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)
11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)
12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)
13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)
14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)
15. Blepharisma Macronuclear Code (blepharisma_macronuclear_genetic_code)
16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)
21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)
22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)
23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)
https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes