Custom BioSequences types
If you're a developing your own Bioinformatics package or method, you may find that the reference implementation of concrete LongSequence
types provided in this package are not optimal for your purposes.
This page describes the interfaces for BioSequences' core types for developers or other packages implementing their own sequence types or extending BioSequences functionality.
Implementing custom Alphabets
Recall the required methods that define the Alphabet
interface.
To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the Alphabet
documentation.
Let's do that for a restricted Amino Acid alphabet. We can test that it conforms to the interface with the BioSequences.has_interface
function.
julia> struct ReducedAAAlphabet <: Alphabet end
julia> Base.eltype(::Type{ReducedAAAlphabet}) = AminoAcid
julia> BioSequences.BitsPerSymbol(::ReducedAAAlphabet) = BioSequences.BitsPerSymbol{4}()
julia> function BioSequences.symbols(::ReducedAAAlphabet)
(AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F,
AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M)
end
julia> const (ENC_LUT, DEC_LUT) = let
enc_lut = fill(0xff, length(alphabet(AminoAcid)))
dec_lut = fill(AA_A, length(symbols(ReducedAAAlphabet())))
for (i, aa) in enumerate(symbols(ReducedAAAlphabet()))
enc_lut[reinterpret(UInt8, aa) + 0x01] = i - 1
dec_lut[i] = aa
end
(Tuple(enc_lut), Tuple(dec_lut))
end
((0x02, 0xff, 0x0b, 0x0a, 0x01, 0x0c, 0x09, 0x03, 0x0e, 0xff, 0x00, 0x0d, 0x0f, 0x07, 0x06, 0x04, 0x05, 0x08, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff), (AA_L, AA_C, AA_A, AA_G, AA_S, AA_T, AA_P, AA_F, AA_W, AA_E, AA_D, AA_N, AA_Q, AA_K, AA_H, AA_M))
julia> function BioSequences.encode(::ReducedAAAlphabet, aa::AminoAcid)
i = reinterpret(UInt8, aa) + 0x01
(i ≥ length(ENC_LUT) || @inbounds ENC_LUT[i] === 0xff) && throw(DomainError(aa))
(@inbounds ENC_LUT[i]) % UInt
end
julia> function BioSequences.decode(::ReducedAAAlphabet, x::UInt)
x ≥ length(DEC_LUT) && throw(DomainError(aa))
@inbounds DEC_LUT[x + UInt(1)]
end
julia> BioSequences.has_interface(Alphabet, ReducedAAAlphabet())
true
Implementing custom BioSequences
Recall the required methods that define the BioSequence
interface.
To create an example custom alphabet, we need to create a singleton type, that implements a few methods in order to conform to the interface as described in the BioSequence
documentation.
Let's do that for a custom sequence type that is optimised to represent a small sequence: A Codon. We can test that it conforms to the interface with the BioSequences.has_interface
function.
julia> struct Codon <: BioSequence{RNAAlphabet{2}}
x::UInt8
end
julia> function Codon(iterable)
length(iterable) == 3 || error("Must have length 3")
x = zero(UInt)
for (i, nt) in enumerate(iterable)
x |= BioSequences.encode(Alphabet(Codon), convert(RNA, nt)) << (6-2i)
end
Codon(x % UInt8)
end
Codon
julia> Base.length(::Codon) = 3
julia> BioSequences.encoded_data_eltype(::Type{Codon}) = UInt
julia> function BioSequences.extract_encoded_element(x::Codon, i::Int)
((x.x >>> (6-2i)) & 3) % UInt
end
julia> Base.copy(seq::Codon) = Codon(seq.x)
julia> BioSequences.has_interface(BioSequence, Codon, [RNA_C, RNA_U, RNA_A], false)
true
Interface checking functions
BioSequences.has_interface
— Functionfunction has_interface(::Type{Alphabet}, A::Alphabet)
Returns whether A
conforms to the Alphabet
interface.
has_interface(::Type{BioSequence}, ::T, syms::Vector, mutable::Bool, compat::Bool=true)
Check if type T
conforms to the BioSequence
interface. A T
is constructed from the vector of element types syms
which must not be empty. If the mutable
flag is set, also check the mutable interface. If the compat
flag is set, check for compatibility with existing alphabets.