Reference sequences
DNASequence
(alias of BioSequence{DNAAlphabet{4}}
) is a flexible data structure but always consumes 4 bits per base, which will waste a large part of the memory space when storing reference genome sequences. In such a case, ReferenceSequence
is helpful because it compresses positions of 'N' symbols so that long DNA sequences are stored with almost 2 bits per base. An important limitation is that the ReferenceSequence
type is immutable due to the compression. Other sequence-like operations are supported:
julia> seq = ReferenceSequence(dna"NNCGTATTTTCN")
12nt Reference Sequence:
NNCGTATTTTCN
julia> seq[1]
DNA_N
julia> seq[5]
DNA_T
julia> seq[2:6]
5nt Reference Sequence:
NCGTA
julia> ReferenceSequence(dna"ATGM") # DNA_M is not accepted
ERROR: ArgumentError: invalid symbol M ∉ {A,C,G,T,N} at 4
in convert at /Users/kenta/.julia/v0.4/Bio/src/seq/refseq.jl:58
in call at essentials.jl:56