Indexing & modifying sequences
Indexing
Most BioSequence
concrete subtypes for the most part behave like other vector or string types. They can be indexed using integers or ranges:
For example, with LongSequence
s:
julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC
julia> seq[5]
DNA_T
julia> seq[6:end]
14nt DNA Sequence:
TANAGTNNAGTACC
The biological symbol at a given locus in a biological sequence can be set using setindex:
julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC
julia> seq[5] = DNA_A
DNA_A
Some types such as Kmer
can be indexed using integers but not using ranges.
For LongSequence
types, indexing a sequence by range creates a subsequence of the original sequence. Unlike Arrays
in the standard library, creating this subsequence is copy-free: the subsequence simply points to the original LongSequence
data with its range. You may think that this is unsafe because modifying subsequences propagates to the original sequence, but this doesn't happen actually:
julia> seq = dna"AAAA" # create a sequence
4nt DNA Sequence:
AAAA
julia> subseq = seq[1:2] # create a subsequence from `seq`
2nt DNA Sequence:
AA
julia> subseq[2] = DNA_T # modify the second element of it
DNA_T
julia> subseq # the subsequence is modified
2nt DNA Sequence:
AT
julia> seq # but the original sequence is not
4nt DNA Sequence:
AAAA
This is because modifying a sequence checks whether its underlying data are shared with other sequences under the hood. If and only if the data are shared, the subsequence creates a copy of itself. Any modifying operation does this check. This is called copy-on-write strategy and users don't need to care about it because it is transparent: If the user modifies a sequence with or subsequence, the job of managing and protecting the underlying data of sequences is handled for them.
Modifying sequences
In addition to setindex
, many other modifying operations are possible for biological sequences such as push!
, pop!
, and insert!
, which should be familiar to anyone used to editing arrays.
Base.push!
— Functionpush!(collection, items...) -> collection
Insert one or more items
in collection
. If collection
is an ordered container, the items are inserted at the end (in the given order).
Examples
julia> push!([1, 2, 3], 4, 5, 6)
6-element Vector{Int64}:
1
2
3
4
5
6
If collection
is ordered, use append!
to add all the elements of another collection to it. The result of the preceding example is equivalent to append!([1, 2, 3], [4, 5, 6])
. For AbstractSet
objects, union!
can be used instead.
See sizehint!
for notes about the performance model.
See also pushfirst!
.
Base.pop!
— Functionpop!(collection) -> item
Remove an item in collection
and return it. If collection
is an ordered container, the last item is returned; for unordered containers, an arbitrary element is returned.
See also: popfirst!
, popat!
, delete!
, deleteat!
, splice!
, and push!
.
Examples
julia> A=[1, 2, 3]
3-element Vector{Int64}:
1
2
3
julia> pop!(A)
3
julia> A
2-element Vector{Int64}:
1
2
julia> S = Set([1, 2])
Set{Int64} with 2 elements:
2
1
julia> pop!(S)
2
julia> S
Set{Int64} with 1 element:
1
julia> pop!(Dict(1=>2))
1 => 2
pop!(collection, key[, default])
Delete and return the mapping for key
if it exists in collection
, otherwise return default
, or throw an error if default
is not specified.
Examples
julia> d = Dict("a"=>1, "b"=>2, "c"=>3);
julia> pop!(d, "a")
1
julia> pop!(d, "d")
ERROR: KeyError: key "d" not found
Stacktrace:
[...]
julia> pop!(d, "e", 4)
4
pop!(seq::BioSequence)
Remove the symbol from the end of a biological sequence seq
and return it. Returns a variable of eltype(seq)
.
Base.pushfirst!
— Functionpushfirst!(collection, items...) -> collection
Insert one or more items
at the beginning of collection
.
This function is called unshift
in many other programming languages.
Examples
julia> pushfirst!([1, 2, 3, 4], 5, 6)
6-element Vector{Int64}:
5
6
1
2
3
4
pushfirst!(seq, x)
Insert a biological symbol x
at the beginning of a biological sequence seq
.
Base.popfirst!
— Functionpopfirst!(collection) -> item
Remove the first item
from collection
.
This function is called shift
in many other programming languages.
See also: pop!
, popat!
, delete!
.
Examples
julia> A = [1, 2, 3, 4, 5, 6]
6-element Vector{Int64}:
1
2
3
4
5
6
julia> popfirst!(A)
1
julia> A
5-element Vector{Int64}:
2
3
4
5
6
popfirst!(seq)
Remove the symbol from the beginning of a biological sequence seq
and return it. Returns a variable of eltype(seq)
.
Base.insert!
— Functioninsert!(a::Vector, index::Integer, item)
Insert an item
into a
at the given index
. index
is the index of item
in the resulting a
.
See also: push!
, replace
, popat!
, splice!
.
Examples
julia> insert!(Any[1:6;], 3, "here")
7-element Vector{Any}:
1
2
"here"
3
4
5
6
insert!(seq::BioSequence, i, x)
Insert a biological symbol x
into a biological sequence seq
, at the given index i
.
Base.deleteat!
— Methoddeleteat!(seq::BioSequence, i::Integer)
Delete a biological symbol at a single position i
in a biological sequence seq
.
Modifies the input sequence.
Base.append!
— Functionappend!(collection, collections...) -> collection.
For an ordered container collection
, add the elements of each collections
to the end of it.
Specifying multiple collections to be appended requires at least Julia 1.6.
Examples
julia> append!([1], [2, 3])
3-element Vector{Int64}:
1
2
3
julia> append!([1, 2, 3], [4, 5], [6])
6-element Vector{Int64}:
1
2
3
4
5
6
Use push!
to add individual items to collection
which are not already themselves in another collection. The result of the preceding example is equivalent to push!([1, 2, 3], 4, 5, 6)
.
See sizehint!
for notes about the performance model.
See also vcat
for vectors, union!
for sets, and prepend!
and pushfirst!
for the opposite order.
append!(seq, other)
Add a biological sequence other
onto the end of biological sequence seq
. Modifies and returns seq
.
Base.resize!
— Functionresize!(a::Vector, n::Integer) -> Vector
Resize a
to contain n
elements. If n
is smaller than the current collection length, the first n
elements will be retained. If n
is larger, the new elements are not guaranteed to be initialized.
Examples
julia> resize!([6, 5, 4, 3, 2, 1], 3)
3-element Vector{Int64}:
6
5
4
julia> a = resize!([6, 5, 4, 3, 2, 1], 8);
julia> length(a)
8
julia> a[1:6]
6-element Vector{Int64}:
6
5
4
3
2
1
resize!(seq, size, [force::Bool])
Resize a biological sequence seq
, to a given size
. Does not resize the underlying data array unless the new size does not fit. If force
, always resize underlying data array.
Base.empty!
— Functionempty!(collection) -> collection
Remove all elements from a collection
.
Examples
julia> A = Dict("a" => 1, "b" => 2)
Dict{String, Int64} with 2 entries:
"b" => 2
"a" => 1
julia> empty!(A);
julia> A
Dict{String, Int64}()
empty!(seq::BioSequence)
Completely empty a biological sequence seq
of nucleotides.
Here are some examples:
julia> seq = dna"ACG"
3nt DNA Sequence:
ACG
julia> push!(seq, DNA_T)
4nt DNA Sequence:
ACGT
julia> append!(seq, dna"AT")
6nt DNA Sequence:
ACGTAT
julia> deleteat!(seq, 2)
5nt DNA Sequence:
AGTAT
julia> deleteat!(seq, 2:3)
3nt DNA Sequence:
AAT
Additional transformations
In addition to these basic modifying functions, other sequence transformations that are common in bioinformatics are also provided.
Base.reverse!
— Methodreverse!(seq::LongSequence)
Reverse a biological sequence seq
in place.
Base.reverse
— Methodreverse(seq::LongSequence)
Create reversed copy of a biological sequence.
BioSequences.complement!
— Functioncomplement!(seq)
Make a complement sequence of seq
in place.
BioSymbols.complement
— Functioncomplement(nt::NucleicAcid)
Return the complementary nucleotide of nt
.
This function returns the union of all possible complementary nucleotides.
Examples
julia> complement(DNA_A)
DNA_T
julia> complement(DNA_N)
DNA_N
julia> complement(RNA_U)
RNA_A
complement(seq)
Make a complement sequence of seq
.
complement(x::T) where {T <: Skipmer}
Return the complement of a short sequence type x
.
BioSequences.reverse_complement!
— Functionreverse_complement!(seq)
Make a reversed complement sequence of seq
in place.
BioSequences.reverse_complement
— Functionreverse_complement(seq)
Make a reversed complement sequence of seq
.
reverse_complement(x::Skipmer)
Return the reverse complement of x
.
BioSequences.ungap!
— FunctionRemove gap characters from an input sequence.
BioSequences.ungap
— FunctionCreate a copy of a sequence with gap characters removed.
BioSequences.canonical!
— Functioncanonical!(seq::NucleotideSeq)
Transforms the seq
into its canonical form, if it is not already canonical. Modifies the input sequence inplace.
For any sequence, there is a reverse complement, which is the same sequence, but on the complimentary strand of DNA:
------->
ATCGATCG
CGATCGAT
<-------
Using the reverse_complement
of a DNA sequence will give give this reverse complement.
Of the two sequences, the canonical of the two sequences is the lesser of the two i.e. canonical_seq < other_seq
.
Using this function on a seq
will ensure it is the canonical version.
BioSequences.canonical
— Functioncanonical(seq::NucleotideSeq)
Create the canonical sequence of seq
.
canonical(kmer::Skipmer)
Return the canonical sequence of x
.
A canonical sequence is the numerical lesser of a k-mer and its reverse complement. This is useful in hashing/counting sequences in data that is not strand specific, and thus observing the short sequence is equivalent to observing its reverse complement.
Some examples:
julia> seq = dna"ACGTAT"
6nt DNA Sequence:
ACGTAT
julia> reverse!(seq)
6nt DNA Sequence:
TATGCA
julia> complement!(seq)
6nt DNA Sequence:
ATACGT
julia> reverse_complement!(seq)
6nt DNA Sequence:
ACGTAT
Many of these methods also have a version which makes a copy of the input sequence, so you get a modified copy, and don't alter the original sequence. Such methods are named the same, but without the exclamation mark. E.g. reverse
instead of reverse!
, and ungap
instead of ungap!
.
Translation
Translation is a slightly more complex transformation for RNA Sequences and so we describe it here in more detail.
The translate
function translates a sequence of codons in a RNA sequence to a amino acid sequence based on a genetic code. The BioSequences
package provides all NCBI defined genetic codes and they are registered in ncbi_trans_table
.
BioSequences.translate
— Functiontranslate(seq, code=standard_genetic_code, allow_ambiguous_codons=true, convert_start_codon=false)
Translate an LongRNASeq
or a LongDNASeq
to an LongAminoAcidSeq
.
Translation uses genetic code code
to map codons to amino acids. See ncbi_trans_table
for available genetic codes. If codons in the given sequence cannot determine a unique amino acid, they will be translated to AA_X
if allow_ambiguous_codons
is true
and otherwise result in an error. For organisms that utilize alternative start codons, one can set alternative_start=true
, in which case the first codon will always be converted to a methionine.
BioSequences.ncbi_trans_table
— ConstantGenetic code list of NCBI.
The standard genetic code is ncbi_trans_table[1]
and others can be shown by show(ncbi_trans_table)
. For more details, consult the next link: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes.
julia> ncbi_trans_table
Translation Tables:
1. The Standard Code (standard_genetic_code)
2. The Vertebrate Mitochondrial Code (vertebrate_mitochondrial_genetic_code)
3. The Yeast Mitochondrial Code (yeast_mitochondrial_genetic_code)
4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code (mold_mitochondrial_genetic_code)
5. The Invertebrate Mitochondrial Code (invertebrate_mitochondrial_genetic_code)
6. The Ciliate, Dasycladacean and Hexamita Nuclear Code (ciliate_nuclear_genetic_code)
9. The Echinoderm and Flatworm Mitochondrial Code (echinoderm_mitochondrial_genetic_code)
10. The Euplotid Nuclear Code (euplotid_nuclear_genetic_code)
11. The Bacterial, Archaeal and Plant Plastid Code (bacterial_plastid_genetic_code)
12. The Alternative Yeast Nuclear Code (alternative_yeast_nuclear_genetic_code)
13. The Ascidian Mitochondrial Code (ascidian_mitochondrial_genetic_code)
14. The Alternative Flatworm Mitochondrial Code (alternative_flatworm_mitochondrial_genetic_code)
16. Chlorophycean Mitochondrial Code (chlorophycean_mitochondrial_genetic_code)
21. Trematode Mitochondrial Code (trematode_mitochondrial_genetic_code)
22. Scenedesmus obliquus Mitochondrial Code (scenedesmus_obliquus_mitochondrial_genetic_code)
23. Thraustochytrium Mitochondrial Code (thraustochytrium_mitochondrial_genetic_code)
24. Pterobranchia Mitochondrial Code (pterobrachia_mitochondrial_genetic_code)
25. Candidate Division SR1 and Gracilibacteria Code (candidate_division_sr1_genetic_code)
https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=cgencodes