Recipes

This page provides tested example code to solve various common problems using BioSequences.

One-hot encoding biosequences

The types DNA, RNA and AminoAcid expose a binary representation through the exported function BioSymbols.compatbits, which is a one-hot encoding of:

julia> using BioSymbols

julia> compatbits(DNA_W)
0x09

julia> compatbits(AA_J)
0x00000600

Each set bit in the encoding corresponds to a compatible unambiguous symbol. For example, for RNA, the four lower bits encode A, C, G, and U, in order. Hence, the symbol D, which is short for A, G or U, is encoded as 0x01 | 0x04 | 0x08 == 0x0d:

julia> compatbits(RNA_D)
0x0d

julia> compatbits(RNA_A) | compatbits(DNA_G) | compatbits(RNA_U)
0x0d

Using this, we can construct a function to one-hot encode sequences - in this example, nucleic acid sequences:

function one_hot(s::NucSeq)
    M = falses(4, length(s))
    for (i, s) in enumerate(s)
        bits = compatbits(s)
        while !iszero(bits)
            M[trailing_zeros(bits) + 1, i] = true
            bits &= bits - one(bits) # clear lowest bit
        end
    end
    M
end

one_hot(dna"TGNTKCTW-T")

# output

4×10 BitMatrix:
 0  0  1  0  0  0  0  1  0  0
 0  0  1  0  0  1  0  0  0  0
 0  1  1  0  1  0  0  0  0  0
 1  0  1  1  1  0  1  1  0  1