Construction & conversion
Here we will showcase the various ways you can construct the various sequence types in BioSequences.
Constructing sequences
From strings
Sequences can be constructed from strings using their constructors:
julia> LongDNA{4}("TTANC")
5nt DNA Sequence:
TTANC
julia> LongSequence{DNAAlphabet{2}}("TTAGC")
5nt DNA Sequence:
TTAGC
julia> LongRNA{4}("UUANC")
5nt RNA Sequence:
UUANC
julia> LongSequence{RNAAlphabet{2}}("UUAGC")
5nt RNA Sequence:
UUAGC
Type alias' can also be used for brevity.
julia> LongDNA{4}("TTANC")
5nt DNA Sequence:
TTANC
julia> LongDNA{2}("TTAGC")
5nt DNA Sequence:
TTAGC
julia> LongRNA{4}("UUANC")
5nt RNA Sequence:
UUANC
julia> LongRNA{2}("UUAGC")
5nt RNA Sequence:
UUAGC
Constructing sequences from arrays of BioSymbols
Sequences can be constructed using vectors or arrays of a BioSymbol
type:
julia> LongDNA{4}([DNA_T, DNA_T, DNA_A, DNA_N, DNA_C])
5nt DNA Sequence:
TTANC
julia> LongSequence{DNAAlphabet{2}}([DNA_T, DNA_T, DNA_A, DNA_G, DNA_C])
5nt DNA Sequence:
TTAGC
Constructing sequences from other sequences
You can create sequences, by concatenating other sequences together:
julia> LongDNA{2}("ACGT") * LongDNA{2}("TGCA")
8nt DNA Sequence:
ACGTTGCA
julia> repeat(LongDNA{4}("TA"), 10)
20nt DNA Sequence:
TATATATATATATATATATA
julia> LongDNA{4}("TA") ^ 10
20nt DNA Sequence:
TATATATATATATATATATA
Sequence views (LongSubSeq
s) are special, in that they do not own their own data, and must be constructed from a LongSequence
or another LongSubSeq
:
julia> seq = LongDNA{4}("TACGGACATTA")
11nt DNA Sequence:
TACGGACATTA
julia> seqview = LongSubSeq(seq, 3:7)
5nt DNA Sequence:
CGGAC
julia> seqview2 = @view seq[1:3]
3nt DNA Sequence:
TAC
julia> typeof(seqview) == typeof(seqview2) && typeof(seqview) <: LongSubSeq
true
Conversion of sequence types
You can convert between sequence types, if the sequences are compatible - that is, if the source sequence does not contain symbols that are un-encodable by the destination type.
julia> dna = dna"TTACGTAGACCG"
12nt DNA Sequence:
TTACGTAGACCG
julia> dna2 = convert(LongDNA{2}, dna)
12nt DNA Sequence:
TTACGTAGACCG
DNA/RNA are special in that they can be converted to each other, despite containing distinct symbols. When doing so, DNA_T
is converted to RNA_U
and vice versa.
julia> convert(LongRNA{2}, dna"TAGCTAGG")
8nt RNA Sequence:
UAGCUAGG
String literals
BioSequences provides several string literal macros for creating sequences.
When you use literals you may mix the case of characters.
Long sequence literals
julia> dna"TACGTANNATC"
11nt DNA Sequence:
TACGTANNATC
julia> rna"AUUUGNCCANU"
11nt RNA Sequence:
AUUUGNCCANU
julia> aa"ARNDCQEGHILKMFPSTWYVX"
21aa Amino Acid Sequence:
ARNDCQEGHILKMFPSTWYVX
However, it should be noted that by default these sequence literals allocate the LongSequence
object before the code containing the sequence literal is run. This means there may be occasions where your program does not behave as you first expect. For example consider the following code:
julia> function foo()
s = dna"CTT"
push!(s, DNA_A)
end
foo (generic function with 1 method)
You might expect that every time you call foo
, that a DNA sequence CTTA
would be returned. You might expect that this is because every time foo
is called, a new DNA sequence variable CTT
is created, and the A
nucleotide is pushed to it, and the result, CTTA
is returned. In other words you might expect the following output:
julia> foo()
4nt DNA Sequence:
CTTA
julia> foo()
4nt DNA Sequence:
CTTA
julia> foo()
4nt DNA Sequence:
CTTA
However, this is not what happens, instead the following happens:
julia> foo()
4nt DNA Sequence:
CTTA
julia> foo()
5nt DNA Sequence:
CTTAA
julia> foo()
6nt DNA Sequence:
CTTAAA
The reason for this is because the sequence literal is allocated only once before the first time the function foo
is called and run. Therefore, s
in foo
is always a reference to that one sequence that was allocated. So one sequence is created before foo
is called, and then it is pushed to every time foo
is called. Thus, that one allocated sequence grows with every call of foo
.
If you wanted foo
to create a new sequence each time it is called, then you can add a flag to the end of the sequence literal to dictate behaviour: A flag of 's' means 'static': the sequence will be allocated before code is run, as is the default behaviour described above. However providing 'd' flag changes the behaviour: 'd' means 'dynamic': the sequence will be allocated whilst the code is running, and not before. So to change foo
so as it creates a new sequence each time it is called, simply add the 'd' flag to the sequence literal:
julia> function foo()
s = dna"CTT"d # 'd' flag appended to the string literal.
push!(s, DNA_A)
end
foo (generic function with 1 method)
Now every time foo
is called, a new sequence CTT
is created, and an A
nucleotide is pushed to it:
julia> foo()
4nt DNA Sequence:
CTTA
julia> foo()
4nt DNA Sequence:
CTTA
julia> foo()
4nt DNA Sequence:
CTTA
So the take home message of sequence literals is this:
Be careful when you are using sequence literals inside of functions, and inside the bodies of things like for loops. And if you use them and are unsure, use the 's' and 'd' flags to ensure the behaviour you get is the behaviour you intend.
BioSequences.@dna_str
— Macro@dna_str(seq, flag="s") -> LongDNA{4}
Create a LongDNA{4}
sequence at parse time from string seq
. If flag
is "s"
('static', the default), the sequence is created at parse time, and inserted directly into the returned expression. A static string ought not to be mutated Alternatively, if flag
is "d"
(dynamic), a new sequence is parsed and created whenever the code where is macro is placed is run.
Examples
In the example below, the static sequence is created once, at parse time, NOT when the function f
is run. This means it is the same sequence that is pushed to repeatedly.
julia> f() = dna"TAG";
julia> string(push!(f(), DNA_A)) # NB: Mutates static string!
"TAGA"
julia> string(push!(f(), DNA_A))
"TAGAA"
julia> f() = dna"TAG"d; # dynamically make seq
julia> string(push!(f(), DNA_A))
"TAGA"
julia> string(push!(f(), DNA_A))
"TAGA"
BioSequences.@rna_str
— MacroThe LongRNA{4}
equivalent to @dna_str
Examples
julia> rna"UCGUGAUGC"
9nt RNA Sequence:
UCGUGAUGC
BioSequences.@aa_str
— MacroThe AminoAcidAlphabet
equivalent to @dna_str
Examples
julia> aa"PKLEQC"
6aa Amino Acid Sequence:
PKLEQC
Loose parsing
As of version 3.2.0, BioSequences.jl provide the bioseq
function, which can be used to build a LongSequence
from a string (or an AbstractVector{UInt8}
) without knowing the correct Alphabet
.
julia> bioseq("ATGTGCTGA")
9nt DNA Sequence:
ATGTGCTGA
The function will prioritise 2-bit alphabets over 4-bit alphabets, and prefer smaller alphabets (like DNAAlphabet{4}
) over larger (like AminoAcidAlphabet
). If the input cannot be encoded by any of the built-in alphabets, an error is thrown:
julia> bioseq("0!(CC!;#&&%")
ERROR: cannot encode 0x30 in AminoAcidAlphabet
[...]
Note that this function is only intended to be used for interactive, ephemeral work. The function is necessarily type unstable, and the precise returned alphabet for a given input is a heuristic which is subject to change.
BioSequences.bioseq
— Functionbioseq(s::Union{AbstractString, AbstractVector{UInt8}}) -> LongSequence
Parse s
into a LongSequence
with an appropriate Alphabet
, or throw an exception if no alphabet matches. See guess_alphabet
for the available alphabets and the alphabet priority.
The functions bioseq
and guess_alphabet
are intended for use in interactive sessions, and are not suitable for use in packages or non-ephemeral work. They are type unstable, and their heuristics are subject to change in minor versions.
Examples
julia> bioseq("QMKLPEEFW")
9aa Amino Acid Sequence:
QMKLPEEFW
julia> bioseq("UAUGCUGUAGG")
11nt RNA Sequence:
UAUGCUGUAGG
julia> bioseq("PKMW#3>>0;kL")
ERROR: cannot encode 0x23 in AminoAcidAlphabet
[...]
BioSequences.guess_alphabet
— Functionguess_alphabet(s::Union{AbstractString, AbstractVector{UInt8}}) -> Union{Integer, Alphabet}
Pick an Alphabet
that can encode input s
. If no Alphabet
can, return the index of the first byte of the input which is not encodable in any alphabet. This function only knows about the alphabets listed below. If multiple alphabets are possible, pick the first from the order below (i.e. DNAAlphabet{2}()
if possible, otherwise RNAAlphabet{2}()
etc).
DNAAlphabet{2}()
RNAAlphabet{2}()
DNAAlphabet{4}()
RNAAlphabet{4}()
AminoAcidAlphabet()
The functions bioseq
and guess_alphabet
are intended for use in interactive sessions, and are not suitable for use in packages or non-ephemeral work. They are type unstable, and their heuristics are subject to change in minor versions.
Examples
julia> guess_alphabet("AGGCA")
DNAAlphabet{2}()
julia> guess_alphabet("WKLQSTV")
AminoAcidAlphabet()
julia> guess_alphabet("QAWT+!")
5
julia> guess_alphabet("UAGCSKMU")
RNAAlphabet{4}()
Comparison to other sequence types
Following Base standards, BioSequences do not compare equal to other containers even if they have the same elements. To e.g. compare a BioSequence with a vector of DNA, compare the elements themselves:
julia> seq = dna"GAGCTGA"; vec = collect(seq);
julia> seq == vec, isequal(seq, vec)
(false, false)
julia> length(seq) == length(vec) && all(i == j for (i, j) in zip(seq, vec))
true