Indexing & modifying sequences
Indexing
Most BioSequence concrete subtypes for the most part behave like other vector or string types. They can be indexed using integers or ranges:
For example, with LongSequences:
julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC
julia> seq[5]
DNA_T
julia> seq[6:end]
14nt DNA Sequence:
TANAGTNNAGTACC
The biological symbol at a given locus in a biological sequence can be set using setindex:
julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC
julia> seq[5] = DNA_A
DNA_A
Some types such can be indexed using integers but not using ranges.
For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.
Modifying sequences
In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.
Base.push! — Methodpush!(seq::BioSequence, x)Append a biological symbol x to a biological sequence seq.
Base.pop! — Methodpop!(seq::BioSequence)Remove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).
Base.pushfirst! — Methodpushfirst!(seq, x)Insert a biological symbol x at the beginning of a biological sequence seq.
Base.popfirst! — Methodpopfirst!(seq)Remove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).
Base.insert! — Methodinsert!(seq::BioSequence, i, x)Insert a biological symbol x into a biological sequence seq, at the given index i.
Base.deleteat! — Methoddeleteat!(seq::BioSequence, i::Integer)Delete a biological symbol at a single position i in a biological sequence seq.
Modifies the input sequence.
Base.append! — Methodappend!(seq, other)Add a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.
Base.resize! — Methodresize!(seq, size, [force::Bool])Resize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.
Base.empty! — Methodempty!(seq::BioSequence)Completely empty a biological sequence seq of nucleotides.
Here are some examples:
julia> seq = dna"ACG"
3nt DNA Sequence:
ACG
julia> push!(seq, DNA_T)
4nt DNA Sequence:
ACGT
julia> append!(seq, dna"AT")
6nt DNA Sequence:
ACGTAT
julia> deleteat!(seq, 2)
5nt DNA Sequence:
AGTAT
julia> deleteat!(seq, 2:3)
3nt DNA Sequence:
AAT
Additional transformations
In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.
Base.reverse! — Methodreverse!(seq::LongSequence)Reverse a biological sequence seq in place.
Base.reverse — Methodreverse(seq::BioSequence)Create reversed copy of a biological sequence.
reverse(seq::LongSequence)Create reversed copy of a biological sequence.
BioSequences.complement! — Functioncomplement!(seq)Make a complement sequence of seq in place.
BioSymbols.complement — Functioncomplement(nt::NucleicAcid)Return the complementary nucleotide of nt.
This function returns the union of all possible complementary nucleotides.
Examples
julia> complement(DNA_A)
DNA_T
julia> complement(DNA_N)
DNA_N
julia> complement(RNA_U)
RNA_A
complement(seq)Make a complement sequence of seq.
BioSequences.reverse_complement! — Functionreverse_complement!(seq)Make a reversed complement sequence of seq in place.
BioSequences.reverse_complement — Functionreverse_complement(seq)Make a reversed complement sequence of seq.
BioSequences.ungap! — FunctionRemove gap characters from an input sequence.
BioSequences.ungap — FunctionCreate a copy of a sequence with gap characters removed.
BioSequences.canonical! — Functioncanonical!(seq::NucleotideSeq)Transforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.
For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:
------->
ATCGATCG
CGATCGAT
<-------Using the reverse_complement of a DNA sequence will give give this reverse complement.
Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.
Using this function on a seq will ensure it is the canonical version.
BioSequences.canonical — Functioncanonical(seq::NucleotideSeq)Create the canonical sequence of seq.
Some examples:
julia> seq = dna"ACGTAT"
6nt DNA Sequence:
ACGTAT
julia> reverse!(seq)
6nt DNA Sequence:
TATGCA
julia> complement!(seq)
6nt DNA Sequence:
ATACGT
julia> reverse_complement!(seq)
6nt DNA Sequence:
ACGTAT
Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!.
Translation
Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.
The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.
BioSequences.translate — Functiontranslate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)Translate an LongRNA or a LongDNA to an LongAA.
Translation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.
BioSequences.ncbi_trans_table — ConstantGenetic code list of NCBI.
The standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.
julia> ncbi_trans_table
Translation Tables:
1. The Standard Code (standard_genetic_code)
2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)
4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)
5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)
6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)
9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)
10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)
11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)
12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)
13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)
14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)
16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)
21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)
22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)
23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)
https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes