CIGARStrings.jl reference

CIGARStrings.CIGARType
CIGAR <: AbstractCIGAR

A CIGAR string represents the sequence of insertions, matches and deletions that comprise a pairwise alignment. Construct a CIGAR from any object x where MemoryView(x) returns a MemoryView{UInt8}, i.e. any memory-backed bytearray, or string.

Use CIGARStrings.try_parse to attempt to parse a CIGAR string without throwing an exception if the data is invalid.

See also: CIGARElement

Extended help

CIGAR strings are sequences of CIGARElement, from the 5' to the 3' of the query (or N- to C-terminal for amino acids). CIGAR strings comprise the entire query, i.e. the sum of lengths of elements with the XMI=S operations equals the length of the query.

For example, the query AGCGTAGCACACC that aligns from query base 5 and ref base 1002, like this:

Q: 5    TAG--CACACC   13
R: 1002 TAGGACAC-CC 1011

Is summarized by the CIGAR 4S3M2D3M1I2M. The operations HX=PN are more rarely used, see CIGAROp for a description of the operations.

CIGARStrings.BAMCIGARType
BAMCIGAR <: AbstractCIGAR

A BAMCIGAR is an alternative representation of a CIGAR, stored compactly in 32-bit integers. Semantically, a BAMCIGAR behaves much similar to a CIGAR.

Construct a BAMCIGAR either from a CIGAR, or using encode! or encode_append!

Examples

julia> c = CIGAR("9S123M1=3I15M2H");

julia> b = BAMCIGAR(c);

julia> c == b
true

julia> CIGAR(b)
CIGAR("9S123M1=3I15M2H")
CIGARStrings.CIGARElementType
CIGARElement(op::CIGAROp, len::Integer)

Type representing a single element in a CIGAR string, consisting of a CIGAROp and a length. For example, in 15S198M1D15M, the four elements are 15S, 198M, 1D, and 15M. Access the operation and the length with the properties .op and .len. Note that currently, the largest supported length is 268435455. Operations cannot have length zero.

See also: CIGAR, CIGAROp

Examples

julia> e = CIGARElement(OP_X, 3)
CIGARElement(OP_X, 3)

julia> e.len
3

julia> e.op
OP_X
CIGARStrings.CIGAROpType
CIGAROp

1-byte primitive type representing a CIGAR operation. Module-level constants for each value of this type are exported, e.g. OP_M.

A CIGAROp is guaranteed to be a 1-byte primitive with the same bitwise representation as a UInt8 with the value given as N in the table below, e.g. reinterpret(UInt8, OP_I) === 0x01.

The Consumes entry below signifies if the operation advances the position in the query (Q), the reference (R) and the alignment (A). E.g. if the CIGARElement 9D begins at query, ref, aln positions (A, B, C), then the positions are (A, B+9, C+9) after.

NConsumesVariableCharDescription
0Q R AOP_MMAlignment match or mismatch
1Q AOP_IIInsertion relative to the reference
2R AOP_DDDeletion relative to the reference
3ROP_NNReference skipped from the alignment (usually an intron)
4QOP_SSSoft clip (consumes query)
5OP_HHHard clip (does not consume query)
6OP_PPPadding, position not present in query or reference
7Q R AOP_Eq=Alignment match, not mismatch
8Q R AOP_XXAlignment mismatch

See also: CIGARElement

Extended help

  • M means a match or a mismatch. By default, most programs will emit M instead of X or =, since the important part of the CIGAR is where the insertions and deletions are placed. To determine which bases in an M are matches and mismatches, the two sequences can be compared, base-wise.
  • N means that the region spans an intron, which means the query sequence is deleted, but not due to an actual deletion (which would be a D operation). It can also be used for other uses where the reference bases are missing for another reason than a deletion, if such a use case is found.
  • S and H both represent that ends of query does not align. They differ only in whether the clipped bases, i.e. the ends of the query sequence not part of the alignment, are present in the query sequence. For soft clip, the sequence is present in the query, but not for hard clip. Soft clip is preferred, and hard clip is only used in rare cases where removing part of the query sequence can lead to memory savings.
  • P is only used for a multiple sequence alignment. If a third sequence contains an insertion relative to both the query and the reference, the query has a P at this position, indicating it's present in neither the query nor reference.
  • Most alignments only contain the operations M, I, D, S.
CIGARStrings.TranslationType
Translation(kind::TranslationKind, pos::Int)

The result of translating from a position in the coordinate system of the query / reference / alignment to a position in one of the others. This type contains two documented properties: kind::TranslationKind and pos::Int.

  • If the resulting position is outside the target coordinate, .kind == outside and pos == 0

  • If the position maps to a non-gap symbol in the other coordinate system, .kind == pos, and the position is the pos'd symbol in the target coordinate system.

  • If the position maps to a gap, pos is the position of the symbol before the gap, and kind == gap. When translating to the aln coordinate system, the kind is never gap.

See also: CIGARStrings.TranslationKind

Examples:

julia> c = CIGAR("2M3D2M2I1M");

julia> for i in 0:8
           refpos = query_to_ref(c, i)
           println(refpos.pos, ' ', refpos.kind)
       end
0 outside
1 pos
2 pos
6 pos
7 pos
7 gap
7 gap
8 pos
0 outside
CIGARStrings.TranslationKindType
TranslationKind

This enum has values outside, pos and gap. It represents the result of translating a position between query, reference and alignment.

  • If outside, the input position translates to a position outside the alignment

  • If pos, the input position corresponds to a non-gap symbol in the alignment

  • If gap, the input position maps to a gap.

See also: Translation

CIGARStrings.CIGARErrorType
CIGARError

Exception kind thrown by the package CIGARStrings. CIGARErrors contain two properties: .kind, returning a CIGARErrorType, and .index, returning an Int, pointing to the approximate byte index where the exception was encountered.

The kind and index are stable API, in the sense that e.g. an integer overflow at position 55 will always be represented by a CIGARError(55, CIGARStrings.Errors.IntegerOverflow). However, the same invalid CIGAR string may produce multiple different errors, and which error is produced in that case is NOT stable API, because that depends on specifics parsing internals.

julia> ce = CIGARStrings.try_parse(CIGAR, "15M9");

julia> ce.index
4

julia> ce.kind
Truncated::CIGARErrorType = 0x05
CIGARStrings.Errors.CIGARErrorTypeType
CIGARErrorType

Single-byte enum representing the different kind of errors returned or thrown by the package CIGARStrings. This currently implemented list of error kinds is not exhaustive, in that more could be added in minor versions of this package.

See also: CIGARError

CIGARStrings.try_parseFunction
try_parse(::Type{CIGAR}, x)::Union{CIGAR, CIGARError}

Cast x to a MemoryView{UInt8}, and try parsing a CIGAR from it. If the parsing is unsuccessful, return a CIGARError

Examples

julia> c = CIGARStrings.try_parse(CIGAR, "2S1M9I");

julia> c isa CIGAR # success
true

julia> c = CIGARStrings.try_parse(CIGAR, "1S7H9M1S");

julia> c.kind
InvalidHardClip::CIGARErrorType = 0x03
CIGARStrings.unsafe_switch_memoryFunction
unsafe_switch_memory(cigar::T, mem::ImmutableMemoryView{UInt8})::T where {T <: AbstractCIGAR}

Create a new instance of typeof(cigar) equal to cigar, but using the new memory mem which must be equal to the existing memory backing cigar. This operation does not do any validation.

This function is unsafe, because it assumes that mem == MemoryView(cigar). If this assumption is violated, any subsequent operation on the resulting AbstractCIGAR may cause undefined behaviour.

Examples

julia> mem = MemoryView("5S12M1X8M10S");

julia> cigar_1 = CIGAR(mem);

julia> cigar_2 = unsafe_switch_memory(cigar_1, copy(mem));

julia> cigar_1 == cigar_2
true

julia> MemoryView(cigar_1) === MemoryView(cigar_2)
false
CIGARStrings.encode!Function
encode!(
        mem::MutableMemoryView{UInt8},
        ::Type{T <: AbstractCIGAR},
        cigar::AbstractCIGAR
    )::T

Encode cigar as type T into the beginning of mem, and return the result as a new T backed by mem.

Throw a LightBoundsError if mem is too short to contain the whole encoding.

Warning

Since the first bytes of mem are used to back the newly created cigar, mutating mem after this function is being called will invalidate the created cigar.

Examples

julia> c = CIGAR("3S5M1D19X2I6M3H");

julia> mem = MemoryView(zeros(UInt8, 60));

julia> bc = encode!(mem, BAMCIGAR, c)
BAMCIGAR(CIGAR("3S5M1D19X2I6M3H"))

julia> n_bytes = length(MemoryView(bc));

julia> ImmutableMemoryView(mem)[1:n_bytes] === MemoryView(bc)
true
CIGARStrings.encode_append!Function
encode_append!(
        v::Vector{UInt8},
        ::Type{T <: AbstractCIGAR},
        cigar::AbstractCIGAR
    )::T

Encode cigar as type T by appending its representation to v, and return the result as a new T backed by v's memory.

Warning

Since the returned cigar will use v as backing memory, mutating v after this function has been called will invalidate the created cigar.

Examples

julia> c = CIGAR("3S5M1D19X2I6M3H");

julia> v = UInt8[1, 2, 3, 4];

julia> bc = encode_append!(v, BAMCIGAR, c)
BAMCIGAR(CIGAR("3S5M1D19X2I6M3H"))

julia> v[1:4] == 1:4 # unchanged
true

julia> MemoryView(bc) === ImmutableMemoryView(v)[5:end]
true
CIGARStrings.OP_NConstant

'N': Region skipped from the reference (usually an intron)

CIGARStrings.PositionMapperType
PositionMapper{Params...}

Public struct returned by pos_to_pos(..., pos) when !isa(pos, Integer). This type may be referred to in stable code, but its API is defined by pos_to_pos

See also: pos_to_pos

CIGARStrings.queryFunction

Coordinate system representing the 1-based index in the query sequence

CIGARStrings.refFunction

Coordinate system representing the 1-based index in the reference sequence

CIGARStrings.alnFunction

Coordinate system representing the 1-based index in the alignment (i.e. query/ref including gaps)

CIGARStrings.ref_lengthFunction
ref_length(::AbstractCIGAR)::Int

Get the number of biosymbols in the reference of the CIGAR. This is the same as the lengths of all CIGARElements of type M, D, N, = and X.

See also: query_length, aln_length

Example

julia> ref_length(CIGAR("1S5M2D6M2I5M"))
18
CIGARStrings.query_lengthFunction
query_length(::AbstractCIGAR)::Int

Get the number of biosymbols in the query of the CIGAR. This is the same as the lengths of all CIGARElements of type M, I, S, = and X.

See also: ref_length, aln_length

Example

julia> query_length(CIGAR("1S5M2D6M2I5M"))
19
CIGARStrings.aln_lengthFunction
aln_length(::AbstractCIGAR)::Int

Get the number of biosymbols spanned by the alignment of the CIGAR. Clips, padding and skips are not part of the alignment, but still part of the CIGAR. Therefore, the alignment length is the same as the lengths of all CIGARElements of type M, I, D, = and X.

See also: query_length, ref_length

Example

julia> aln_length(CIGAR("1S5M2D6M2I5M"))
20
CIGARStrings.aln_identityFunction
aln_identity(::AbstractCIGAR, mismatches::Int)::Float64

Compute the alignment identity of the AbstractCIGAR, computed as the number of matches (not mismatches) divided by alignment length. Since the CIGAR itself may not provide information about the precise number of mismatches, the amount of mismatches is an argument. The result is always in [0.0, 1.0].

Example

julia> aln_identity(CIGAR("3M1D3M1I2M"), 2)
0.6
CIGARStrings.count_matchesFunction
count_matches(::AbstractCIGAR, mismatches::Integer)::Int

Count the number of matches in the AbstractCIGAR, given the input number of mismatches. The mismatches argument is necessary, because OP_M is ambiguous, and can encode any combination of matches and mismatches.

If mismatches is not in X:(M+X) where M is the number of symbols marked OP_M, and X the number of OP_X, throw a DomainError.

Examples

julia> c = CIGAR("3S9M2I15=6X7=10M3S"); # 19M, 22=, 6X

julia> count_matches(c, 8)
39

julia> count_matches(c, 25)
22

julia> count_matches(c, 5) # lower bound is 6
ERROR: DomainError
[...]

julia> count_matches(c, 26) # upper bound is 25
ERROR: DomainError
[...]
CIGARStrings.query_to_refFunction
query_to_ref(x::AbstractCIGAR, pos::Integer)::Translation

Get the 1-based reference position aligning to query position pos. See Translation for more details.

Examples

julia> c = CIGAR("3M2D1M4I2M");

julia> query_to_ref(c, 4)
Translation(pos, 6)
CIGARStrings.query_to_alnFunction
query_to_aln(x::AbstractCIGAR, pos::Integer)::Translation

Get the 1-based alignment position aligning to query position pos. See Translation for more details.

Examples

julia> c = CIGAR("3M2D1M4I2M");

julia> query_to_aln(c, 8)
Translation(pos, 10)
CIGARStrings.ref_to_queryFunction
ref_to_query(x::AbstractCIGAR, pos::Integer)::Translation

Get the 1-based query position aligning to reference position pos. See Translation for more details.

Examples

julia> c = CIGAR("3M2D1M4I2M");

julia> ref_to_query(c, 7)
Translation(pos, 9)
CIGARStrings.ref_to_alnFunction
ref_to_aln(x::AbstractCIGAR, pos::Integer)::Translation

Get the 1-based alignment position aligning to reference position pos. See Translation for more details.

Examples

julia> c = CIGAR("3M2D1M4I2M");

julia> ref_to_aln(c, 7)
Translation(pos, 11)
CIGARStrings.aln_to_queryFunction
aln_to_query(x::AbstractCIGAR, pos::Integer)::Translation

Get the 1-based query position aligning to alignment position pos. See Translation for more details.

Examples

julia> c = CIGAR("3M2D1M4I2M");

julia> aln_to_query(c, 10)
Translation(pos, 8)
CIGARStrings.aln_to_refFunction
aln_to_ref(x::AbstractCIGAR, pos::Integer)::Translation

Get the 1-based reference position aligning to alignment position pos. See Translation for more details.

Examples

julia> c = CIGAR("3M2D1M4I2M");

julia> aln_to_ref(c, 9)
Translation(gap, 6)
CIGARStrings.normalizeFunction
normalize(cigar::T)::T where {T <: AbstractCIGAR}

Create a new AbstractCIGAR, where the same alignment as cigar is written in its canonical form.

The normalization process makes the following changes:

  • OP_Eq and OP_X are converted to OP_M

  • OP_P and OP_H operations are removed

  • Consecutive elements with the same operations are merged

The resulting cigar therefore only contain the operations: M, I, D, S, N.

If is_compatible(a, b), and a and b are of the same type, then normalize(a) == normalize(b) is guaranteed, but not the other way around.

Since a normalized cigar represents the same alignment, normalization will not affect position translation e.g. using pos_to_pos.

If the merging of two consecutive elements would result in a CIGARElement with a length of more than 268435455, throw a CIGARError with kind Errors.IntegerOverflow.

See also: normalize!, unsafe_normalize, is_compatible

Examples

julia> normalize(CIGAR("2H2S3=1X2=4P3=1=1I2X4H"))
CIGAR("2S10M1I2M")

julia> normalize(CIGAR("5M1D8M1I7M"))
CIGAR("5M1D8M1I7M")
CIGARStrings.normalize!Function
normalize!(cigar::T, mem::MutableMemoryView{UInt8})::T where {T <: AbstractCIGAR}

Same as normalize, but uses allocation of mem instead of allocating a new array.

Throws a LightBoundsError if mem cannot hold the normalized cigar. A cigar after normalization is guaranteed to use at most as much memory as cigar before normalization, so if length(mem) ≥ length(MemoryView(cigar)), a bounds error cannot happen.

Warning

If mem aliases MemoryView(cigar), this function may return garbage results. Use unsafe_normalize to re-use the memory of a cigar for the normalization result.

See also: normalize, unsafe_normalize, is_compatible

Examples

julia> c = CIGAR("1M2M1=2=3M");

julia> mem = MemoryView([0x00, 0x00]);

julia> c2 = normalize!(c, mem)
CIGAR("9M")

julia> MemoryView(c2) === ImmutableMemoryView(mem)
true

julia> c2 = normalize!(c, MemoryView([0x00]))
ERROR: LightBoundsErrors.LightBoundsError: out-of-bounds indexing: `collection[2]`, where:
[...]
CIGARStrings.unsafe_normalizeFunction
unsafe_normalize(cigar::T)::T where {T <: AbstractCIGAR}

Create a new normalized cigar, re-using the memory of the input cigar. The normalization process is the same as normalize.

Warning

Calling unsafe_normalize on a cigar will mutate its memory, even though a cigar is constructed from an ImmutableMemoryView. Callers must ensure that there are no other references to the memory of cigar which assumes that the memory is immutable. For example, calling this function on a CIGAR constructed from a String risks undefined behaviour, since observing Strings being mutated is undefined behaviour in Julia.

See also: normalize, normalize!, is_compatible

Examples

julia> c = CIGAR(collect(b"2H1X1=2H"));

julia> mem = MemoryView(c);

julia> c2 = unsafe_normalize(c)
CIGAR("2S2M2S")

julia> MemoryView(c2).ref === mem.ref
true