General-purpose sequences

BioSequence{A} is a generic sequence type parameterized by an alphabet type A that defines the domain (or set) of biological symbols, and each alphabet has an associated symbol type. For example, AminoAcidAlphabet is associated with AminoAcid and hence an object of the BioSequence{AminoAcidAlphabet} type represents a sequence of amino acids. Symbols from multiple alphabets can't be intermixed in one sequence type.

The following table summarizes common sequence types that are defined in the BioSequences module:

Type	Symbol type	Type alias
`BioSequence{DNAAlphabet{4}}`	`DNA`	`DNASequence`
`BioSequence{RNAAlphabet{4}}`	`RNA`	`RNASequence`
`BioSequence{AminoAcidAlphabet}`	`AminoAcid`	`AminoAcidSequence`
`BioSequence{CharAlphabet}`	`Char`	`CharSequence`

Parameterized definition of the BioSequence{A} type is for the purpose of unifying the data structure and operations of any symbol type. In most cases, users don't have to care about it and can use type aliases listed above. However, the alphabet type fixes the internal memory encoding and plays an important role when optimizing performance of a program (see Using a more compact sequence representation section for low-memory encodings). It also enables a user to define their own alphabet only by defining few numbers of methods. This is described in Defining a new alphabet section.

Constructing sequences

Using string literals

Most immediately, sequence literals can be constructed using the string macros dna, rna, aa, and char:

julia> dna"TACGTANNATC"
11nt DNA Sequence:
TACGTANNATC

julia> rna"AUUUGNCCANU"
11nt RNA Sequence:
AUUUGNCCANU

julia> aa"ARNDCQEGHILKMFPSTWYVX"
21aa Amino Acid Sequence:
ARNDCQEGHILKMFPSTWYVX

julia> char"αβγδϵ"
5char Char Sequence:
αβγδϵ

However it should be noted that by default these sequence literals allocate the BioSequence object before the code containing the sequence literal is run. This means there may be occasions where your program does not behave as you first expect. For example consider the following code:

julia> function foo()
           s = dna"CTT"
           push!(s, DNA_A)
       end
foo (generic function with 1 method)

You might expect that every time you call foo, that a DNA sequence CTTA would be returned. You might expect that this is because every time foo is called, a new DNA sequence variable CTT is created, and and A nucleotide is pushed to it, and the result, CTTA is returned. In other words you might expect the following output:

julia> foo()
4nt DNA Sequence:
CTTA

julia> foo()
4nt DNA Sequence:
CTTA

julia> foo()
4nt DNA Sequence:
CTTA

However, this is not what happens, instead the following happens:

julia> foo()
4nt DNA Sequence:
CTTA

julia> foo()
5nt DNA Sequence:
CTTAA

julia> foo()
6nt DNA Sequence:
CTTAAA

The reason for this is because the sequence literal is allocated only once before the first time the function foo is called and run. Therefore, s in foo is always a reference to that one sequence that was allocated. So one sequence is created before foo is called, and then it is pushed to every time foo is called. Thus, that one allocated sequence grows with every call of foo.

If you wanted foo to create a new sequence each time it is called, then you can add a flag to the end of the sequence literal to dictate behaviour: A flag of 's' means 'static': the sequence will be allocated before code is run, as is the default behaviour described above. However providing 'd' flag changes the behaviour: 'd' means 'dynamic': the sequence will be allocated at whilst the code is running, and not before. So to change foo so as it creates a new sequence each time it is called, simply add the 'd' flag to the sequence literal:

julia> function foo()
           s = dna"CTT"d     # 'd' flag appended to the string literal.
           push!(s, DNA_A)
       end
foo (generic function with 1 method)

Now every time foo is called, a new sequence CTT is created, and an A nucleotide is pushed to it:

julia> foo()
4nt DNA Sequence:
CTTA

julia> foo()
4nt DNA Sequence:
CTTA

julia> foo()
4nt DNA Sequence:
CTTA

So the take home message of sequence literals is this:

Be careful when you are using sequence literals inside of functions, and inside the bodies of things like for loops. And if you use them and are unsure, use the 's' and 'd' flags to ensure the behaviour you get is the behaviour you intend.

Other constructors and conversion

Sequences can also be constructed from strings or arrays of nucleotide or amino acid symbols using constructors or the convert function:

julia> DNASequence("TTANC")
5nt DNA Sequence:
TTANC

julia> DNASequence([DNA_T, DNA_T, DNA_A, DNA_N, DNA_C])
5nt DNA Sequence:
TTANC

julia> convert(DNASequence, [DNA_T, DNA_T, DNA_A, DNA_N, DNA_C])
5nt DNA Sequence:
TTANC

Using convert, these operations are reversible: sequences can be converted to strings or arrays:

julia> convert(String, dna"TTANGTA")
"TTANGTA"

julia> convert(Vector{DNA}, dna"TTANGTA")
7-element Array{DNA,1}:
 DNA_T
 DNA_T
 DNA_A
 DNA_N
 DNA_G
 DNA_T
 DNA_A

Sequences can also be concatenated into longer sequences:

julia> DNASequence(dna"ACGT", dna"NNNN", dna"TGCA")
12nt DNA Sequence:
ACGTNNNNTGCA

julia> dna"ACGT" * dna"TGCA"
8nt DNA Sequence:
ACGTTGCA

julia> repeat(dna"TA", 10)
20nt DNA Sequence:
TATATATATATATATATATA

julia> dna"TA" ^ 10
20nt DNA Sequence:
TATATATATATATATATATA

Despite being separate types, DNASequence and RNASequence can freely be converted between efficiently without copying the underlying data:

julia> dna = dna"TTANGTAGACCG"
12nt DNA Sequence:
TTANGTAGACCG

julia> rna = convert(RNASequence, dna)
12nt RNA Sequence:
UUANGUAGACCG

julia> dna.data === rna.data  # underlying data are same
true

A random sequence can be obtained by the randdnaseq, randrnaseq and randaaseq functions, which generate DNASequence, RNASequence and AminoAcidSequence, respectively. Generated sequences are composed of the standard symbols without ambiguity and gap. For example, randdnaseq(6) may generate dna"TCATAG" but never generates dna"TNANAG" or dna"T-ATAG".

A translatable RNASequence can also be converted to an AminoAcidSequence using the translate function.

Indexing, modifying and transformations

Getindex

Sequences for the most part behave like other vector or string types. They can be indexed using integers or ranges:

julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC

julia> seq[5]
DNA_T

julia> seq[6:end]
14nt DNA Sequence:
TANAGTNNAGTACC

Note that, indexing a biological sequence by range creates a subsequence of the original sequence. Unlike Arrays in the standard library, creating a subsequence is copy-free: a subsequence simply points to the original sequence data with its range. You may think that this is unsafe because modifying subsequences propagates to the original sequence, but this doesn't happen actually:

julia> seq = dna"AAAA"    # create a sequence
4nt DNA Sequence:
AAAA

julia> subseq = seq[1:2]  # create a subsequence from `seq`
2nt DNA Sequence:
AA

julia> subseq[2] = DNA_T  # modify the second element of it
DNA_T

julia> subseq             # the subsequence is modified
2nt DNA Sequence:
AT

julia> seq                # but the original sequence is not
4nt DNA Sequence:
AAAA

This is because modifying a sequence checks whether its underlying data are shared with other sequences under the hood. If and only if the data are shared, the subsequence creates a copy of itself. Any modifying operation does this check. This is called copy-on-write strategy and users don't need to care about it because it is transparent: If the user modifies a sequence with or subsequence, the job of managing and protecting the underlying data of sequences is handled for them.

Setindex and modifying DNA sequences

The biological symbol at a given locus in a biological sequence can be set using setindex:

julia> seq = dna"ACGTTTANAGTNNAGTACC"
19nt DNA Sequence:
ACGTTTANAGTNNAGTACC

julia> seq[5] = DNA_A
DNA_A

In addition, many other modifying operations are possible for biological sequences such as push!, pop!, and insert!, which should be familiar to people used to editing arrays.

Base.push! — Function.

push!(collection, items...) -> collection

Insert one or more items at the end of collection.

Examples

julia> push!([1, 2, 3], 4, 5, 6)
6-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6

Use append! to add all the elements of another collection to collection. The result of the preceding example is equivalent to append!([1, 2, 3], [4, 5, 6]).