Indexing & modifying sequences

Indexing

Most BioSequence concrete subtypes for the most part behave like other vector or string types. They can be indexed using integers or ranges:

For example, with LongSequences:

julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC

julia> seq[5]
DNA_T

julia> seq[6:end]
14nt DNA Sequence:
TANAGTNNAGTACC

The biological symbol at a given locus in a biological sequence can be set using setindex:

julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC

julia> seq[5] = DNA_A
DNA_A

Note

Some types such can be indexed using integers but not using ranges.

For LongSequence types, indexing a sequence by range creates a copy of the original sequence, similar to Array in Julia's Base library. If you find yourself slowed down by the allocation of these subsequences, consider using a sequence view instead.

Modifying sequences

In addition to setindex, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to anyone used to editing arrays.

Base.push! — Method

push!(seq::BioSequence, x)

Append a biological symbol x to a biological sequence seq.

Base.pop! — Method

pop!(seq::BioSequence)

Remove the symbol from the end of a biological sequence seq and return it. Returns a variable of eltype(seq).

Base.pushfirst! — Method

pushfirst!(seq, x)

Insert a biological symbol x at the beginning of a biological sequence seq.

Base.popfirst! — Method

popfirst!(seq)

Remove the symbol from the beginning of a biological sequence seq and return it. Returns a variable of eltype(seq).

Base.insert! — Method

insert!(seq::BioSequence, i, x)

Insert a biological symbol x into a biological sequence seq, at the given index i. Returns the mutated seq.

Examples

julia> seq = dna"ATGCA"
5nt DNA Sequence:
ATGCA

julia> insert!(seq, 3, 'A')
6nt DNA Sequence:
ATAGCA

Base.deleteat! — Method

deleteat!(seq::BioSequence, i::Integer)

Delete a biological symbol at a single position i in a biological sequence seq.

Modifies the input sequence.

Base.append! — Method

append!(seq, other)

Add a biological sequence other onto the end of biological sequence seq. Modifies and returns seq.

Base.resize! — Method

resize!(seq, size, [force::Bool=false])

Resize a biological sequence seq, to a given size. Does not resize the underlying data array unless the new size does not fit. If force, always resize underlying data array.

Note that resizing to a larger size, and then loading from uninitialized positions is not allowed and may cause undefined behaviour. Make sure to always fill any uninitialized biosymbols after resizing.

Base.empty! — Method

empty!(seq::BioSequence)

Completely empty a biological sequence seq of nucleotides.

BioSequences.spliceinto! — Method

spliceinto!(seq::BioSequence, i::Integer, x)

Insert the sequence x into a biological sequence seq, at the given index i. After splicing, the seq's symbols at indices i:i+length(x)-1 are equal to x, and the the symbols that were previously there are moved to the right.

Examples

julia> seq = dna"TAGTGCA";

julia> spliceinto!(seq, 3, "CAGGA")
12nt DNA sequence:
TACAGGAGTGCA

BioSequences.spliceinto! — Method

spliceinto!(seq::BioSequence, span::UnitRange, x)

Delete the symbols at indices span in seq, and then copy x into the first deleted position, then return seq.

This is equivalent to deleteat!(seq, span); spliceinto!(seq, first(span), x), but is more efficient. span must be nonempty, or this function will throw an ArgumentError. To handle potentially empty spans, check if the span is empty, and if so use spliceinto(seq, first(span), x).

Examples

julia> seq = dna"TAGTGCA";

julia> spliceinto!(seq, 3:5, "CAGGA")
9nt DNA sequence:
TACAGGACA

Here are some examples:

julia> seq = dna"ACG"
3nt DNA Sequence:
ACG

julia> push!(seq, DNA_T)
4nt DNA Sequence:
ACGT

julia> append!(seq, dna"AT")
6nt DNA Sequence:
ACGTAT

julia> deleteat!(seq, 2)
5nt DNA Sequence:
AGTAT

julia> deleteat!(seq, 2:3)
3nt DNA Sequence:
AAT

Additional transformations

In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.

Base.reverse! — Method

reverse!(seq::LongSequence)

Reverse a biological sequence seq in place.

Base.reverse — Method

reverse(seq::BioSequence)

Create reversed copy of a biological sequence.

reverse(seq::LongSequence)

Create reversed copy of a biological sequence.

BioSequences.complement! — Function

complement!(seq)

Make a complement sequence of seq in place.

BioSymbols.complement — Function

complement(nt::NucleicAcid)

Return the complementary nucleotide of nt.

This function returns the union of all possible complementary nucleotides.

Examples

julia> complement(DNA_A)
DNA_T

julia> complement(DNA_N)
DNA_N

julia> complement(RNA_U)
RNA_A

complement(seq)

Make a complement sequence of seq.

BioSequences.reverse_complement! — Function

reverse_complement!(seq)

Make a reversed complement sequence of seq in place.

BioSequences.reverse_complement — Function

reverse_complement(seq)

Make a reversed complement sequence of seq.

BioSequences.ungap! — Function

Remove gap characters from an input sequence.

BioSequences.ungap — Function

Create a copy of a sequence with gap characters removed.

BioSequences.canonical! — Function

canonical!(seq::NucleotideSeq)

Transforms the seq into its canonical form, if it is not already canonical. Modifies the input sequence inplace.

For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:

------->
ATCGATCG
CGATCGAT
<-------

Note

Using the reverse_complement of a DNA sequence will give give this reverse complement.

Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq.

Using this function on a seq will ensure it is the canonical version.

BioSequences.canonical — Function

canonical(seq::NucleotideSeq)

Create the canonical sequence of seq.

Some examples:

julia> seq = dna"ACGTAT"
6nt DNA Sequence:
ACGTAT

julia> reverse!(seq)
6nt DNA Sequence:
TATGCA

julia> complement!(seq)
6nt DNA Sequence:
ATACGT

julia> reverse_complement!(seq)
6nt DNA Sequence:
ACGTAT

Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse instead of reverse!, and ungap instead of ungap!.

Translation

Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.

The translate function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table.

BioSequences.translate — Function

translate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, alternative_start=false)

Translate an LongRNA or a LongDNA to an LongAA.

Translation uses genetic code code to map codons to amino acids. See ncbi_trans_table for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X if allow_ambiguous_codons is true and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true, in which case the first codon will always be converted to a methionine.

BioSequences.ncbi_trans_table — Constant

Genetic code list of NCBI.

The standard genetic code is ncbi_trans_table[1] and others can be shown by show(ncbi_trans_table). For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.

julia> ncbi_trans_table
Translation Tables:
  1. The Standard Code (standard_genetic_code)
  2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
  3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)
  4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)
  5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)
  6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)
  9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)
 10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)
 11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)
 12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)
 13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)
 14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)
 15. Blepharisma Macronuclear Code (blepharisma_macronuclear_genetic_code)
 16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)
 21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)
 22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)
 23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
 24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
 25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)

https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes