CIGARStrings.jl
CIGARStrings.jl provides functionality for parsing and working with Concise Idiosyncratic Gapped Alignment Report (CIGAR) strings. CIGARs were popularized by the SAM format, and are a compact run-length encoding notation used to represent pairwise alignments. They can be found in the SAM, BAM, PAF, and GFA formats.
For example, the following pairwise alignment of a query to a reference:
Q: TAGAT---TAGCTAC
|||| || | |
R: TAGAACCATA--TGCcan be represented by the CIGAR 5M3D2M2I3M, representing:
5 matches/mismatches
Then, 3 deletions
Then, 2 matches/mismatches
Then, 2 insertions
Finally, 3 matches/mismatches.
A CIGAR string is always written in terms of the query, not the reference.
Individual alignment operations
One run of identical alignment operations, e.g. "5 matches/mismatches," is represented
by a single CIGARElement.
Conceptually, a CIGARElement is an alignment operation (represented by a CIGAROp) and a length:
CIGARStrings.CIGARElement — Type
CIGARElement(op::CIGAROp, len::Integer)Type representing a single element in a CIGAR string, consisting of a
CIGAROp and a length.
For example, in 15S198M1D15M, the four elements are 15S, 198M, 1D,
and 15M.
Access the operation and the length with the properties .op and .len.
Note that currently, the largest supported length is 268435455.
Operations cannot have length zero.
Examples
julia> e = CIGARElement(OP_X, 3)
CIGARElement(OP_X, 3)
julia> e.len
3
julia> e.op
OP_XCIGARStrings.CIGAROp — Type
CIGAROp1-byte primitive type representing a CIGAR operation.
Module-level constants for each value of this type are exported, e.g. OP_M.
A CIGAROp is guaranteed to be a 1-byte primitive with the same bitwise
representation as a UInt8 with the value given as N in the table below,
e.g. reinterpret(UInt8, OP_I) === 0x01.
The Consumes entry below signifies if the operation advances the position
in the query (Q), the reference (R) and the alignment (A).
E.g. if the CIGARElement 9D begins at query, ref, aln positions (A, B, C),
then the positions are (A, B+9, C+9) after.
| N | Consumes | Variable | Char | Description |
|---|---|---|---|---|
| 0 | Q R A | OP_M | M | Alignment match or mismatch |
| 1 | Q A | OP_I | I | Insertion relative to the reference |
| 2 | R A | OP_D | D | Deletion relative to the reference |
| 3 | R | OP_N | N | Reference skipped from the alignment (usually an intron) |
| 4 | Q | OP_S | S | Soft clip (consumes query) |
| 5 | | OP_H | H | Hard clip (does not consume query) |
| 6 | | OP_P | P | Padding, position not present in query or reference |
| 7 | Q R A | OP_Eq | = | Alignment match, not mismatch |
| 8 | Q R A | OP_X | X | Alignment mismatch |
See also: CIGARElement
Extended help
Mmeans a match or a mismatch. By default, most programs will emitMinstead ofXor=, since the important part of the CIGAR is where the insertions and deletions are placed. To determine which bases in anMare matches and mismatches, the two sequences can be compared, base-wise.Nmeans that the region spans an intron, which means the query sequence is deleted, but not due to an actual deletion (which would be aDoperation). It can also be used for other uses where the reference bases are missing for another reason than a deletion, if such a use case is found.SandHboth represent that ends of query does not align. They differ only in whether the clipped bases, i.e. the ends of the query sequence not part of the alignment, are present in the query sequence. For soft clip, the sequence is present in the query, but not for hard clip. Soft clip is preferred, and hard clip is only used in rare cases where removing part of the query sequence can lead to memory savings.Pis only used for a multiple sequence alignment. If a third sequence contains an insertion relative to both the query and the reference, the query has aPat this position, indicating it's present in neither the query nor reference.- Most alignments only contain the operations
M,I,D,S.
CIGARs
A CIGAR string is represented by an AbstractCIGAR, which currently has two subtypes: CIGAR and BAMCIGAR.
These types differ in their memory layout: The former stores the CIGAR as its ASCII representation (as used in the SAM format), and the latter stores it in a binary format (as used in the BAM format).
Both types store their underlying data as an ImmutableMemoryView{UInt8}.
CIGARStrings.AbstractCIGAR — Type
abstract type AbstractCIGARThis abstract type is (not yet) a defined interface.
Its concrete subtypes are CIGAR and BAMCIGAR.
The API for these two types is almost interchangeable, so examples below use CIGAR, since its plaintext representation makes examples easier to read.
See BAMCIGAR section for a list of all differences between the two types.
CIGAR strings are validated upon construction.
julia> CIGAR("2M1D3M")
CIGAR("2M1D3M")
julia> CIGAR("1M1W1S")
ERROR: Error around byte 4: Invalid operation. Possible values are "MIDNSHP=X".
[...]Since CIGAR strings occur in various bioinformatics file formats, it is expected
that users of CIGARStrings.jl will construct CIGARs from a view into a buffer storing a chunk of the file.
This is zero-copy, and does not allocate on Julia 1.14 and later. For example:
julia> data = b"some format with CIGAR: 15M9D18M etc";
julia> buffer = collect(data);
julia> c = CIGAR(view(buffer, 25:32))
CIGAR("15M9D18M")CIGARStrings.CIGAR — Type
CIGAR <: AbstractCIGARA CIGAR string represents the sequence of insertions, matches and deletions
that comprise a pairwise alignment.
Construct a CIGAR from any object x where MemoryView(x) returns a
MemoryView{UInt8}, i.e. any memory-backed bytearray, or string.
Use CIGARStrings.try_parse to attempt to parse a CIGAR string
without throwing an exception if the data is invalid.
See also: CIGARElement
Extended help
CIGAR strings are sequences of CIGARElement, from the 5' to the 3' of the
query (or N- to C-terminal for amino acids).
CIGAR strings comprise the entire query, i.e. the sum of lengths of elements
with the XMI=S operations equals the length of the query.
For example, the query AGCGTAGCACACC that aligns from query base 5 and ref
base 1002, like this:
Q: 5 TAG--CACACC 13
R: 1002 TAGGACAC-CC 1011Is summarized by the CIGAR 4S3M2D3M1I2M. The operations HX=PN are more rarely
used, see CIGAROp for a description of the operations.
CIGARs are iterable, and return their CIGARElements in order:
julia> collect(CIGAR("2M1D3M"))
3-element Vector{CIGARElement}:
CIGARElement(OP_M, 2)
CIGARElement(OP_D, 1)
CIGARElement(OP_M, 3)They can be converted back to strings using string(::CIGAR), or printed into
an IO, in which case their normal string representation is recovered:
julia> c = CIGAR("2M1D3M");
julia> string(c)
"2M1D3M"
julia> io = IOBuffer(); print(io, c); String(take!(io))
"2M1D3M"The memory underlying the CIGAR types can be obtained with MemoryView(x) using the MemoryViews.jl package:
julia> c = CIGAR("19M");
julia> println(MemoryView(c))
UInt8[0x31, 0x39, 0x4d]
julia> println(MemoryView(BAMCIGAR(c)))
UInt8[0x30, 0x01, 0x00, 0x00]Basic information about a CIGAR
The reference, query and alignment length can be obtained with the functions
ref_length, query_length and aln_length.
In the alignment below, represented as 5M3D2M2I3M, the query length is
12, since there are 12 query symbols, the reference length is 13, and the
alignment length is 15.
Q: TAGAT---TAGCTAC
|||| || | |
R: TAGAACCATA--TGCjulia> c = CIGAR("5M3D2M2I3M");
julia> ref_length(c)
13
julia> query_length(c)
12
julia> aln_length(c)
15Since the CIGAR operation M (OP_M) is ambiguous about whether it represents matches,
mismatches, or a combination of these, the function count_matches can be used to
count the number of matches in a CIGAR given the number of mismatches.
Mismatch counts are typically output by mappers, making this information readily accessible.
Alignment identity (number of matches, excluding mismatches, divided by alignment length)
can be obtained with aln_identity.
Like count_matches, this takes the number of mismatches as an argument.
Comparing CIGARs
When comparing CIGARs using ==, Julia checks whether the CIGARs are literally identical, in the
sense that they are composed of the same bytes:
julia> c1 = CIGAR("10M");
julia> c2 = CIGAR("4=1X5=");
julia> c3 = CIGAR("10M");
julia> c1 == c2
false
julia> c1 == c3
trueHowever, in the above example, since the CIGAR operation M signifies a match or a mismatch, all three
CIGARs are indeed compatible, since 10M is also a valid CIGAR annotation for the same alignment
as 4=1X5=.
This notion of compatibility can be tested with is_compatible:
CIGARStrings.is_compatible — Function
is_compatible(a::AbstractCIGAR, b::AbstractCIGAR)::BoolCheck if a and b may refer to the same alignment.
Two distinct CIGARs may refer to the same alignment, because there are multiple ways to annotate the same alignment. In particular, the following rules are used when determining if distinct CIGAR operations are equivalent:
Two consecutive of the same operations may be collapsed, ex:
1M1Mand2MAn
OP_Mcan encompass bothOP_XandOP_Eq, ex:3Mand1X1M1=OP_PandOP_Hhave no semantic meaning and is skipped:1D1P2D1M3Pand3D1M
See also: normalize
Examples:
julia> is_compatible(CIGAR("1="), CIGAR("1X"))
false
julia> is_compatible(CIGAR("1D1D1D"), CIGAR("3D"))
true
julia> is_compatible(CIGAR("1D2M3I"), CIGAR("1D1X1M3I"))
truePosition translation
Sometimes it may be necessary to answer questions of the form "which reference position does query position 8 align to?".
As an example, consider the alignment below. The query position (QP), reference position (RP) and alignment position (AP) are also written in this alignment.
QP:12345 6789012
Q: TAGAT---TAGCTAC
|||| || | |
R: TAGAACCATA--TGC
RP:1234567890 123
AP:123456789012345We can see that query position 6 aligns to reference position 9, which is also alignment position 9.
These position translations can be obtained using the function pos_to_pos,
specifying the source and destination coordinate systems query, ref
or aln.
When passed an integer, this function returns a Translation object with two properties: .pos and .kind.
When a position translation has a straightforward answer, the .kind property is
CIGARStrings.pos, and the .pos field is the corresponding position:
julia> c = CIGAR("4M3D2M2I3M"); # alignment above
julia> pos_to_pos(query, ref, c, 6)
Translation(pos, 9)
julia> pos_to_pos(aln, query, c, 9)
Translation(pos, 6)Note that these operations run in linear time, as they scan the CIGAR string from the beginning.
To efficiently query multiple translations in the same scan of the CIGAR string, you can pass a sorted (ascending) iterator of integers.
In this case, pos_to_pos returns a lazy iterator of Pair{Int, Translation}, representing source_position => mapped_translation:
julia> c = CIGAR("4M3D2M2I3M"); # alignment above
julia> it = pos_to_pos(query, ref, c, [1, 5, 6, 11]);
julia> length(it)
4
julia> collect(it)
4-element Vector{Pair{Int64, Translation}}:
1 => Translation(pos, 1)
5 => Translation(pos, 8)
6 => Translation(pos, 9)
11 => Translation(pos, 12)CIGARStrings.pos_to_pos — Function
pos_to_pos(from::Coordinate, to::Coordinate, cigar::AbstractCIGAR, pos::Integer)::TranslationSimilar to the generic pos_to_pos, but returns the resulting integer immediately instead of returning
a lazy iterator.
pos_to_pos(
from::Coordinate,
to::Coordinate,
cigar::AbstractCIGAR,
pos
)::PositionMapperMap positions from one coordinate system in the alignment given by cigar to another.
Given pos, an iterable of sorted (ascending) integers in the coordinate system of from,
returns a PositionMapper - an iterable of Pair{Int, Translation} mapping each input integer,
in order, to the corresponding coordinate system in to, given as a Translation.
The coordinates may be query, ref or aln:
queryrepresents the 1-based index into the query sequencerefrepresents the 1-based index into the reference sequencealnis the index of the alignment itself, i.e. equivalent to the sequence of either the query or the ref when gaps according to indel operations ofc.
For example, given the query positions p = [4, 9, 11] and c::AbstractCIGAR, the corresponding
positions in the reference can be obtained by [i.second.pos for i in pos_to_pos(query, ref, c, p)]
See also: Translation, ref_to_query, aln_to_ref etc.
julia> iter = pos_to_pos(query, ref, CIGAR("5M2D10M"), [0, 4, 9, 16]);
julia> iter isa CIGARStrings.PositionMapper
true
julia> collect(iter)
4-element Vector{Pair{Int64, Translation}}:
0 => Translation(outside, 0)
4 => Translation(pos, 4)
9 => Translation(pos, 11)
16 => Translation(outside, 0)CIGARStrings.Translation — Type
Translation(kind::TranslationKind, pos::Int)The result of translating from a position in the coordinate system of the
query / reference / alignment to a position in one of the others.
This type contains two documented properties: kind::TranslationKind and pos::Int.
If the resulting position is outside the target coordinate,
.kind == outsideandpos == 0If the position maps to a non-gap symbol in the other coordinate system,
.kind == pos, and the position is thepos'd symbol in the target coordinate system.If the position maps to a gap,
posis the position of the symbol before the gap, andkind == gap. When translating to thealncoordinate system, the kind is nevergap.
See also: CIGARStrings.TranslationKind
Examples:
julia> c = CIGAR("2M3D2M2I1M");
julia> for i in 0:8
refpos = query_to_ref(c, i)
println(refpos.pos, ' ', refpos.kind)
end
0 outside
1 pos
2 pos
6 pos
7 pos
7 gap
7 gap
8 pos
0 outsideCIGARStrings.TranslationKind — Type
TranslationKindThis enum has values outside, pos and gap. It represents the result of
translating a position between query, reference and alignment.
If
outside, the input position translates to a position outside the alignmentIf
pos, the input position corresponds to a non-gap symbol in the alignmentIf
gap, the input position maps to a gap.
See also: Translation
Normalization
The CIGAR format is redundant, in that the same alignment can be written in multiple different ways. In particular:
The
PandHoperations mean nothing with respect to the query and reference.Pis only used to pad with respect to a third sequence, andHsignifies that part of the true query is missing from the input query sequence.The
=andXoperations are usually redundant withM, since the information of matches/mismatches is not given by the alignment itself, but can be determined from the input sequences given the alignment.Consecutive runs of the same operation are allowed, such as
1M1M, but are better written as2M.
This package provides the functions normalize, normalize!, and unsafe_normalize, which create new CIGARs written in canonical form.
In canonical form, each of the points above is addressed: H and P is removed, = and X are converted to M, and consecutive identical operations are merged.
Note that the normalized form of a CIGAR corresponds to the same pairwise alignment.
Therefore, it is guaranteed that if is_compatible(a, b), then normalize(a) == normalize(b) (though not the other way around).
It is also guaranteed that the result of position translation is identical for a CIGAR and its normalized version.
Errors and error recovery
CIGARStrings.jl allows you to parse a potential CIGAR string without throwing an exception if the data is invalid, using the function CIGARStrings.try_parse.
CIGARStrings.CIGARError — Type
CIGARErrorException kind thrown by the package CIGARStrings.
CIGARErrors contain two properties: .kind, returning a CIGARErrorType,
and .index, returning an Int, pointing to the approximate byte index where the
exception was encountered.
The kind and index are stable API, in the sense that e.g. an integer overflow
at position 55 will always be represented by a
CIGARError(55, CIGARStrings.Errors.IntegerOverflow).
However, the same invalid CIGAR string may produce multiple different errors, and which
error is produced in that case is NOT stable API, because that depends on specifics
parsing internals.
julia> ce = CIGARStrings.try_parse(CIGAR, "15M9");
julia> ce.index
4
julia> ce.kind
Truncated::CIGARErrorType = 0x05CIGARStrings.Errors.CIGARErrorType — Type
CIGARErrorTypeSingle-byte enum representing the different kind of errors returned or thrown by the package CIGARStrings. This currently implemented list of error kinds is not exhaustive, in that more could be added in minor versions of this package.
See also: CIGARError
CIGARStrings.try_parse — Function
try_parse(::Type{CIGAR}, x)::Union{CIGAR, CIGARError}Cast x to a MemoryView{UInt8}, and try parsing a CIGAR from it.
If the parsing is unsuccessful, return a CIGARError
Examples
julia> c = CIGARStrings.try_parse(CIGAR, "2S1M9I");
julia> c isa CIGAR # success
true
julia> c = CIGARStrings.try_parse(CIGAR, "1S7H9M1S");
julia> c.kind
InvalidHardClip::CIGARErrorType = 0x03The BAMCIGAR type
A CIGAR in the BAM format is stored in an array of 32-bit integers.
However, in order to make zero-copy CIGARs possible, the BAMCIGAR type is backed by an ImmutableMemoryView{UInt8} instead, with the same memory layout as its equivalent Memory{UInt32}.
CIGARStrings.BAMCIGAR — Type
BAMCIGAR <: AbstractCIGARA BAMCIGAR is an alternative representation of a CIGAR,
stored compactly in 32-bit integers.
Semantically, a BAMCIGAR behaves much similar to a CIGAR.
Construct a BAMCIGAR either from a CIGAR, or using encode!
or encode_append!
Examples
julia> c = CIGAR("9S123M1=3I15M2H");
julia> b = BAMCIGAR(c);
julia> c == b
true
julia> CIGAR(b)
CIGAR("9S123M1=3I15M2H")A BAMCIGAR can be constructed from its binary representation using any type that implements MemoryViews.MemoryView:
julia> BAMCIGAR("\x54\4\0\0\x70\4\0\0")
BAMCIGAR(CIGAR("69S71M"))This is not zero-cost: like CIGAR, the type contains some metadata and is validated upon construction.
Like CIGAR, the try_parse function can be used:
julia> CIGARStrings.try_parse(BAMCIGAR, "\x5f\4\0\0\x70\4\0\0")
CIGARStrings.CIGARError(1, CIGARStrings.Errors.InvalidOperation)CIGAR and BAMCIGAR can be converted infallably to each other:
julia> c = CIGAR("6H19S18M1I22=8I2S");
julia> b = BAMCIGAR(c);
julia> b == c
true
julia> CIGAR(b) == c
trueNote that printing a BAMCIGAR allocates, because it needs to allocate a new piece of memory to store its ASCII representation.