CIGARStrings.jl

CIGARStrings.jl provides functionality for parsing and working with Concise Idiosyncratic Gapped Alignment Report (CIGAR) strings. CIGARs were popularized by the SAM format, and are a compact run-length encoding notation used to represent pairwise alignments. They can be found in the SAM, BAM, PAF, and GFA formats.

For example, the following pairwise alignment of a query to a reference:

    Q: TAGAT---TAGCTAC
       ||||    ||  | |
    R: TAGAACCATA--TGC

can be represented by the CIGAR 5M3D2M2I3M, representing:

  1. 5 matches/mismatches

  2. Then, 3 deletions

  3. Then, 2 matches/mismatches

  4. Then, 2 insertions

  5. Finally, 3 matches/mismatches.

A CIGAR string is always written in terms of the query, not the reference.

Individual alignment operations

One run of identical alignment operations, e.g. "5 matches/mismatches," is represented by a single CIGARElement. Conceptually, a CIGARElement is an alignment operation (represented by a CIGAROp) and a length:

CIGARStrings.CIGARElementType
CIGARElement(op::CIGAROp, len::Integer)

Type representing a single element in a CIGAR string, consisting of a CIGAROp and a length. For example, in 15S198M1D15M, the four elements are 15S, 198M, 1D, and 15M. Access the operation and the length with the properties .op and .len. Note that currently, the largest supported length is 268435455. Operations cannot have length zero.

See also: CIGAR, CIGAROp

Examples

julia> e = CIGARElement(OP_X, 3)
CIGARElement(OP_X, 3)

julia> e.len
3

julia> e.op
OP_X
CIGARStrings.CIGAROpType
CIGAROp

1-byte primitive type representing a CIGAR operation. Module-level constants for each value of this type are exported, e.g. OP_M.

A CIGAROp is guaranteed to be a 1-byte primitive with the same bitwise representation as a UInt8 with the value given as N in the table below, e.g. reinterpret(UInt8, OP_I) === 0x01.

The Consumes entry below signifies if the operation advances the position in the query (Q), the reference (R) and the alignment (A). E.g. if the CIGARElement 9D begins at query, ref, aln positions (A, B, C), then the positions are (A, B+9, C+9) after.

NConsumesVariableCharDescription
0Q R AOP_MMAlignment match or mismatch
1Q AOP_IIInsertion relative to the reference
2R AOP_DDDeletion relative to the reference
3ROP_NNReference skipped from the alignment (usually an intron)
4QOP_SSSoft clip (consumes query)
5OP_HHHard clip (does not consume query)
6OP_PPPadding, position not present in query or reference
7Q R AOP_Eq=Alignment match, not mismatch
8Q R AOP_XXAlignment mismatch

See also: CIGARElement

Extended help

  • M means a match or a mismatch. By default, most programs will emit M instead of X or =, since the important part of the CIGAR is where the insertions and deletions are placed. To determine which bases in an M are matches and mismatches, the two sequences can be compared, base-wise.
  • N means that the region spans an intron, which means the query sequence is deleted, but not due to an actual deletion (which would be a D operation). It can also be used for other uses where the reference bases are missing for another reason than a deletion, if such a use case is found.
  • S and H both represent that ends of query does not align. They differ only in whether the clipped bases, i.e. the ends of the query sequence not part of the alignment, are present in the query sequence. For soft clip, the sequence is present in the query, but not for hard clip. Soft clip is preferred, and hard clip is only used in rare cases where removing part of the query sequence can lead to memory savings.
  • P is only used for a multiple sequence alignment. If a third sequence contains an insertion relative to both the query and the reference, the query has a P at this position, indicating it's present in neither the query nor reference.
  • Most alignments only contain the operations M, I, D, S.

CIGARs

A CIGAR string is represented by an AbstractCIGAR, which currently has two subtypes: CIGAR and BAMCIGAR. These types differ in their memory layout: The former stores the CIGAR as its ASCII representation (as used in the SAM format), and the latter stores it in a binary format (as used in the BAM format). Both types store their underlying data as an ImmutableMemoryView{UInt8}.

CIGARStrings.AbstractCIGARType
abstract type AbstractCIGAR

This abstract type is (not yet) a defined interface. Its concrete subtypes are CIGAR and BAMCIGAR.

The API for these two types is almost interchangeable, so examples below use CIGAR, since its plaintext representation makes examples easier to read. See BAMCIGAR section for a list of all differences between the two types.

CIGAR strings are validated upon construction.

julia> CIGAR("2M1D3M")
CIGAR("2M1D3M")

julia> CIGAR("1M1W1S")
ERROR: Error around byte 4: Invalid operation. Possible values are "MIDNSHP=X".
[...]

Since CIGAR strings occur in various bioinformatics file formats, it is expected that users of CIGARStrings.jl will construct CIGARs from a view into a buffer storing a chunk of the file.

This is zero-copy, and does not allocate on Julia 1.14 and later. For example:

julia> data = b"some format with CIGAR: 15M9D18M etc";

julia> buffer = collect(data);

julia> c = CIGAR(view(buffer, 25:32))
CIGAR("15M9D18M")
CIGARStrings.CIGARType
CIGAR <: AbstractCIGAR

A CIGAR string represents the sequence of insertions, matches and deletions that comprise a pairwise alignment. Construct a CIGAR from any object x where MemoryView(x) returns a MemoryView{UInt8}, i.e. any memory-backed bytearray, or string.

Use CIGARStrings.try_parse to attempt to parse a CIGAR string without throwing an exception if the data is invalid.

See also: CIGARElement

Extended help

CIGAR strings are sequences of CIGARElement, from the 5' to the 3' of the query (or N- to C-terminal for amino acids). CIGAR strings comprise the entire query, i.e. the sum of lengths of elements with the XMI=S operations equals the length of the query.

For example, the query AGCGTAGCACACC that aligns from query base 5 and ref base 1002, like this:

Q: 5    TAG--CACACC   13
R: 1002 TAGGACAC-CC 1011

Is summarized by the CIGAR 4S3M2D3M1I2M. The operations HX=PN are more rarely used, see CIGAROp for a description of the operations.

CIGARs are iterable, and return their CIGARElements in order:

julia> collect(CIGAR("2M1D3M"))
3-element Vector{CIGARElement}:
 CIGARElement(OP_M, 2)
 CIGARElement(OP_D, 1)
 CIGARElement(OP_M, 3)

They can be converted back to strings using string(::CIGAR), or printed into an IO, in which case their normal string representation is recovered:

julia> c = CIGAR("2M1D3M");

julia> string(c)
"2M1D3M"

julia> io = IOBuffer(); print(io, c); String(take!(io))
"2M1D3M"

The memory underlying the CIGAR types can be obtained with MemoryView(x) using the MemoryViews.jl package:

julia> c = CIGAR("19M");

julia> println(MemoryView(c))
UInt8[0x31, 0x39, 0x4d]

julia> println(MemoryView(BAMCIGAR(c)))
UInt8[0x30, 0x01, 0x00, 0x00]

Basic information about a CIGAR

The reference, query and alignment length can be obtained with the functions ref_length, query_length and aln_length.

In the alignment below, represented as 5M3D2M2I3M, the query length is 12, since there are 12 query symbols, the reference length is 13, and the alignment length is 15.

    Q: TAGAT---TAGCTAC
       ||||    ||  | |
    R: TAGAACCATA--TGC
julia> c = CIGAR("5M3D2M2I3M");

julia> ref_length(c)
13

julia> query_length(c)
12

julia> aln_length(c)
15

Since the CIGAR operation M (OP_M) is ambiguous about whether it represents matches, mismatches, or a combination of these, the function count_matches can be used to count the number of matches in a CIGAR given the number of mismatches.

Mismatch counts are typically output by mappers, making this information readily accessible.

Alignment identity (number of matches, excluding mismatches, divided by alignment length) can be obtained with aln_identity. Like count_matches, this takes the number of mismatches as an argument.

Comparing CIGARs

When comparing CIGARs using ==, Julia checks whether the CIGARs are literally identical, in the sense that they are composed of the same bytes:

julia> c1 = CIGAR("10M");

julia> c2 = CIGAR("4=1X5=");

julia> c3 = CIGAR("10M");

julia> c1 == c2
false

julia> c1 == c3
true

However, in the above example, since the CIGAR operation M signifies a match or a mismatch, all three CIGARs are indeed compatible, since 10M is also a valid CIGAR annotation for the same alignment as 4=1X5=.

This notion of compatibility can be tested with is_compatible:

CIGARStrings.is_compatibleFunction
is_compatible(a::AbstractCIGAR, b::AbstractCIGAR)::Bool

Check if a and b may refer to the same alignment.

Two distinct CIGARs may refer to the same alignment, because there are multiple ways to annotate the same alignment. In particular, the following rules are used when determining if distinct CIGAR operations are equivalent:

  • Two consecutive of the same operations may be collapsed, ex: 1M1M and 2M

  • An OP_M can encompass both OP_X and OP_Eq, ex: 3M and 1X1M1=

  • OP_P and OP_H have no semantic meaning and is skipped: 1D1P2D1M3P and 3D1M

See also: normalize

Examples:

julia> is_compatible(CIGAR("1="), CIGAR("1X"))
false

julia> is_compatible(CIGAR("1D1D1D"), CIGAR("3D"))
true

julia> is_compatible(CIGAR("1D2M3I"), CIGAR("1D1X1M3I"))
true

Position translation

Sometimes it may be necessary to answer questions of the form "which reference position does query position 8 align to?".

As an example, consider the alignment below. The query position (QP), reference position (RP) and alignment position (AP) are also written in this alignment.

    QP:12345   6789012

    Q: TAGAT---TAGCTAC
       ||||    ||  | |
    R: TAGAACCATA--TGC

    RP:1234567890  123
    AP:123456789012345

We can see that query position 6 aligns to reference position 9, which is also alignment position 9.

These position translations can be obtained using the function pos_to_pos, specifying the source and destination coordinate systems query, ref or aln. When passed an integer, this function returns a Translation object with two properties: .pos and .kind.

When a position translation has a straightforward answer, the .kind property is CIGARStrings.pos, and the .pos field is the corresponding position:

julia> c = CIGAR("4M3D2M2I3M"); # alignment above

julia> pos_to_pos(query, ref, c, 6)
Translation(pos, 9)

julia> pos_to_pos(aln, query, c, 9)
Translation(pos, 6)

Note that these operations run in linear time, as they scan the CIGAR string from the beginning.

To efficiently query multiple translations in the same scan of the CIGAR string, you can pass a sorted (ascending) iterator of integers. In this case, pos_to_pos returns a lazy iterator of Pair{Int, Translation}, representing source_position => mapped_translation:

julia> c = CIGAR("4M3D2M2I3M"); # alignment above

julia> it = pos_to_pos(query, ref, c, [1, 5, 6, 11]);

julia> length(it)
4

julia> collect(it)
4-element Vector{Pair{Int64, Translation}}:
  1 => Translation(pos, 1)
  5 => Translation(pos, 8)
  6 => Translation(pos, 9)
 11 => Translation(pos, 12)
CIGARStrings.pos_to_posFunction
pos_to_pos(from::Coordinate, to::Coordinate, cigar::AbstractCIGAR, pos::Integer)::Translation

Similar to the generic pos_to_pos, but returns the resulting integer immediately instead of returning a lazy iterator.

pos_to_pos(
    from::Coordinate,
    to::Coordinate,
    cigar::AbstractCIGAR,
    pos
)::PositionMapper

Map positions from one coordinate system in the alignment given by cigar to another.

Given pos, an iterable of sorted (ascending) integers in the coordinate system of from, returns a PositionMapper - an iterable of Pair{Int, Translation} mapping each input integer, in order, to the corresponding coordinate system in to, given as a Translation.

The coordinates may be query, ref or aln:

  • query represents the 1-based index into the query sequence

  • ref represents the 1-based index into the reference sequence

  • aln is the index of the alignment itself, i.e. equivalent to the sequence of either the query or the ref when gaps according to indel operations of c.

For example, given the query positions p = [4, 9, 11] and c::AbstractCIGAR, the corresponding positions in the reference can be obtained by [i.second.pos for i in pos_to_pos(query, ref, c, p)]

See also: Translation, ref_to_query, aln_to_ref etc.

julia> iter = pos_to_pos(query, ref, CIGAR("5M2D10M"), [0, 4, 9, 16]);

julia> iter isa CIGARStrings.PositionMapper
true

julia> collect(iter)
4-element Vector{Pair{Int64, Translation}}:
  0 => Translation(outside, 0)
  4 => Translation(pos, 4)
  9 => Translation(pos, 11)
 16 => Translation(outside, 0)
CIGARStrings.TranslationType
Translation(kind::TranslationKind, pos::Int)

The result of translating from a position in the coordinate system of the query / reference / alignment to a position in one of the others. This type contains two documented properties: kind::TranslationKind and pos::Int.

  • If the resulting position is outside the target coordinate, .kind == outside and pos == 0

  • If the position maps to a non-gap symbol in the other coordinate system, .kind == pos, and the position is the pos'd symbol in the target coordinate system.

  • If the position maps to a gap, pos is the position of the symbol before the gap, and kind == gap. When translating to the aln coordinate system, the kind is never gap.

See also: CIGARStrings.TranslationKind

Examples:

julia> c = CIGAR("2M3D2M2I1M");

julia> for i in 0:8
           refpos = query_to_ref(c, i)
           println(refpos.pos, ' ', refpos.kind)
       end
0 outside
1 pos
2 pos
6 pos
7 pos
7 gap
7 gap
8 pos
0 outside
CIGARStrings.TranslationKindType
TranslationKind

This enum has values outside, pos and gap. It represents the result of translating a position between query, reference and alignment.

  • If outside, the input position translates to a position outside the alignment

  • If pos, the input position corresponds to a non-gap symbol in the alignment

  • If gap, the input position maps to a gap.

See also: Translation

Normalization

The CIGAR format is redundant, in that the same alignment can be written in multiple different ways. In particular:

  • The P and H operations mean nothing with respect to the query and reference. P is only used to pad with respect to a third sequence, and H signifies that part of the true query is missing from the input query sequence.

  • The = and X operations are usually redundant with M, since the information of matches/mismatches is not given by the alignment itself, but can be determined from the input sequences given the alignment.

  • Consecutive runs of the same operation are allowed, such as 1M1M, but are better written as 2M.

This package provides the functions normalize, normalize!, and unsafe_normalize, which create new CIGARs written in canonical form. In canonical form, each of the points above is addressed: H and P is removed, = and X are converted to M, and consecutive identical operations are merged.

Note that the normalized form of a CIGAR corresponds to the same pairwise alignment. Therefore, it is guaranteed that if is_compatible(a, b), then normalize(a) == normalize(b) (though not the other way around). It is also guaranteed that the result of position translation is identical for a CIGAR and its normalized version.

Errors and error recovery

CIGARStrings.jl allows you to parse a potential CIGAR string without throwing an exception if the data is invalid, using the function CIGARStrings.try_parse.

CIGARStrings.CIGARErrorType
CIGARError

Exception kind thrown by the package CIGARStrings. CIGARErrors contain two properties: .kind, returning a CIGARErrorType, and .index, returning an Int, pointing to the approximate byte index where the exception was encountered.

The kind and index are stable API, in the sense that e.g. an integer overflow at position 55 will always be represented by a CIGARError(55, CIGARStrings.Errors.IntegerOverflow). However, the same invalid CIGAR string may produce multiple different errors, and which error is produced in that case is NOT stable API, because that depends on specifics parsing internals.

julia> ce = CIGARStrings.try_parse(CIGAR, "15M9");

julia> ce.index
4

julia> ce.kind
Truncated::CIGARErrorType = 0x05
CIGARStrings.Errors.CIGARErrorTypeType
CIGARErrorType

Single-byte enum representing the different kind of errors returned or thrown by the package CIGARStrings. This currently implemented list of error kinds is not exhaustive, in that more could be added in minor versions of this package.

See also: CIGARError

CIGARStrings.try_parseFunction
try_parse(::Type{CIGAR}, x)::Union{CIGAR, CIGARError}

Cast x to a MemoryView{UInt8}, and try parsing a CIGAR from it. If the parsing is unsuccessful, return a CIGARError

Examples

julia> c = CIGARStrings.try_parse(CIGAR, "2S1M9I");

julia> c isa CIGAR # success
true

julia> c = CIGARStrings.try_parse(CIGAR, "1S7H9M1S");

julia> c.kind
InvalidHardClip::CIGARErrorType = 0x03

The BAMCIGAR type

A CIGAR in the BAM format is stored in an array of 32-bit integers. However, in order to make zero-copy CIGARs possible, the BAMCIGAR type is backed by an ImmutableMemoryView{UInt8} instead, with the same memory layout as its equivalent Memory{UInt32}.

CIGARStrings.BAMCIGARType
BAMCIGAR <: AbstractCIGAR

A BAMCIGAR is an alternative representation of a CIGAR, stored compactly in 32-bit integers. Semantically, a BAMCIGAR behaves much similar to a CIGAR.

Construct a BAMCIGAR either from a CIGAR, or using encode! or encode_append!

Examples

julia> c = CIGAR("9S123M1=3I15M2H");

julia> b = BAMCIGAR(c);

julia> c == b
true

julia> CIGAR(b)
CIGAR("9S123M1=3I15M2H")

A BAMCIGAR can be constructed from its binary representation using any type that implements MemoryViews.MemoryView:

julia> BAMCIGAR("\x54\4\0\0\x70\4\0\0")
BAMCIGAR(CIGAR("69S71M"))

This is not zero-cost: like CIGAR, the type contains some metadata and is validated upon construction.

Like CIGAR, the try_parse function can be used:

julia> CIGARStrings.try_parse(BAMCIGAR, "\x5f\4\0\0\x70\4\0\0")
CIGARStrings.CIGARError(1, CIGARStrings.Errors.InvalidOperation)

CIGAR and BAMCIGAR can be converted infallably to each other:

julia> c = CIGAR("6H19S18M1I22=8I2S");

julia> b = BAMCIGAR(c);

julia> b == c
true

julia> CIGAR(b) == c
true

Note that printing a BAMCIGAR allocates, because it needs to allocate a new piece of memory to store its ASCII representation.