CIGARStrings.jl reference
CIGARStrings.CIGAR — Type
CIGAR <: AbstractCIGARA CIGAR string represents the sequence of insertions, matches and deletions
that comprise a pairwise alignment.
Construct a CIGAR from any object x where MemoryView(x) returns a
MemoryView{UInt8}, i.e. any memory-backed bytearray, or string.
Use CIGARStrings.try_parse to attempt to parse a CIGAR string
without throwing an exception if the data is invalid.
See also: CIGARElement
Extended help
CIGAR strings are sequences of CIGARElement, from the 5' to the 3' of the
query (or N- to C-terminal for amino acids).
CIGAR strings comprise the entire query, i.e. the sum of lengths of elements
with the XMI=S operations equals the length of the query.
For example, the query AGCGTAGCACACC that aligns from query base 5 and ref
base 1002, like this:
Q: 5 TAG--CACACC 13
R: 1002 TAGGACAC-CC 1011Is summarized by the CIGAR 4S3M2D3M1I2M. The operations HX=PN are more rarely
used, see CIGAROp for a description of the operations.
CIGARStrings.BAMCIGAR — Type
BAMCIGAR <: AbstractCIGARA BAMCIGAR is an alternative representation of a CIGAR,
stored compactly in 32-bit integers.
Semantically, a BAMCIGAR behaves much similar to a CIGAR.
Construct a BAMCIGAR either from a CIGAR, or using encode!
or encode_append!
Examples
julia> c = CIGAR("9S123M1=3I15M2H");
julia> b = BAMCIGAR(c);
julia> c == b
true
julia> CIGAR(b)
CIGAR("9S123M1=3I15M2H")CIGARStrings.CIGARElement — Type
CIGARElement(op::CIGAROp, len::Integer)Type representing a single element in a CIGAR string, consisting of a
CIGAROp and a length.
For example, in 15S198M1D15M, the four elements are 15S, 198M, 1D,
and 15M.
Access the operation and the length with the properties .op and .len.
Note that currently, the largest supported length is 268435455.
Operations cannot have length zero.
Examples
julia> e = CIGARElement(OP_X, 3)
CIGARElement(OP_X, 3)
julia> e.len
3
julia> e.op
OP_XCIGARStrings.CIGAROp — Type
CIGAROp1-byte primitive type representing a CIGAR operation.
Module-level constants for each value of this type are exported, e.g. OP_M.
A CIGAROp is guaranteed to be a 1-byte primitive with the same bitwise
representation as a UInt8 with the value given as N in the table below,
e.g. reinterpret(UInt8, OP_I) === 0x01.
The Consumes entry below signifies if the operation advances the position
in the query (Q), the reference (R) and the alignment (A).
E.g. if the CIGARElement 9D begins at query, ref, aln positions (A, B, C),
then the positions are (A, B+9, C+9) after.
| N | Consumes | Variable | Char | Description |
|---|---|---|---|---|
| 0 | Q R A | OP_M | M | Alignment match or mismatch |
| 1 | Q A | OP_I | I | Insertion relative to the reference |
| 2 | R A | OP_D | D | Deletion relative to the reference |
| 3 | R | OP_N | N | Reference skipped from the alignment (usually an intron) |
| 4 | Q | OP_S | S | Soft clip (consumes query) |
| 5 | | OP_H | H | Hard clip (does not consume query) |
| 6 | | OP_P | P | Padding, position not present in query or reference |
| 7 | Q R A | OP_Eq | = | Alignment match, not mismatch |
| 8 | Q R A | OP_X | X | Alignment mismatch |
See also: CIGARElement
Extended help
Mmeans a match or a mismatch. By default, most programs will emitMinstead ofXor=, since the important part of the CIGAR is where the insertions and deletions are placed. To determine which bases in anMare matches and mismatches, the two sequences can be compared, base-wise.Nmeans that the region spans an intron, which means the query sequence is deleted, but not due to an actual deletion (which would be aDoperation). It can also be used for other uses where the reference bases are missing for another reason than a deletion, if such a use case is found.SandHboth represent that ends of query does not align. They differ only in whether the clipped bases, i.e. the ends of the query sequence not part of the alignment, are present in the query sequence. For soft clip, the sequence is present in the query, but not for hard clip. Soft clip is preferred, and hard clip is only used in rare cases where removing part of the query sequence can lead to memory savings.Pis only used for a multiple sequence alignment. If a third sequence contains an insertion relative to both the query and the reference, the query has aPat this position, indicating it's present in neither the query nor reference.- Most alignments only contain the operations
M,I,D,S.
CIGARStrings.Translation — Type
Translation(kind::TranslationKind, pos::Int)The result of translating from a position in the coordinate system of the
query / reference / alignment to a position in one of the others.
This type contains two documented properties: kind::TranslationKind and pos::Int.
If the resulting position is outside the target coordinate,
.kind == outsideandpos == 0If the position maps to a non-gap symbol in the other coordinate system,
.kind == pos, and the position is thepos'd symbol in the target coordinate system.If the position maps to a gap,
posis the position of the symbol before the gap, andkind == gap. When translating to thealncoordinate system, the kind is nevergap.
See also: CIGARStrings.TranslationKind
Examples:
julia> c = CIGAR("2M3D2M2I1M");
julia> for i in 0:8
refpos = query_to_ref(c, i)
println(refpos.pos, ' ', refpos.kind)
end
0 outside
1 pos
2 pos
6 pos
7 pos
7 gap
7 gap
8 pos
0 outsideCIGARStrings.TranslationKind — Type
TranslationKindThis enum has values outside, pos and gap. It represents the result of
translating a position between query, reference and alignment.
If
outside, the input position translates to a position outside the alignmentIf
pos, the input position corresponds to a non-gap symbol in the alignmentIf
gap, the input position maps to a gap.
See also: Translation
CIGARStrings.CIGARError — Type
CIGARErrorException kind thrown by the package CIGARStrings.
CIGARErrors contain two properties: .kind, returning a CIGARErrorType,
and .index, returning an Int, pointing to the approximate byte index where the
exception was encountered.
The kind and index are stable API, in the sense that e.g. an integer overflow
at position 55 will always be represented by a
CIGARError(55, CIGARStrings.Errors.IntegerOverflow).
However, the same invalid CIGAR string may produce multiple different errors, and which
error is produced in that case is NOT stable API, because that depends on specifics
parsing internals.
julia> ce = CIGARStrings.try_parse(CIGAR, "15M9");
julia> ce.index
4
julia> ce.kind
Truncated::CIGARErrorType = 0x05CIGARStrings.Errors.CIGARErrorType — Type
CIGARErrorTypeSingle-byte enum representing the different kind of errors returned or thrown by the package CIGARStrings. This currently implemented list of error kinds is not exhaustive, in that more could be added in minor versions of this package.
See also: CIGARError
CIGARStrings.try_parse — Function
try_parse(::Type{CIGAR}, x)::Union{CIGAR, CIGARError}Cast x to a MemoryView{UInt8}, and try parsing a CIGAR from it.
If the parsing is unsuccessful, return a CIGARError
Examples
julia> c = CIGARStrings.try_parse(CIGAR, "2S1M9I");
julia> c isa CIGAR # success
true
julia> c = CIGARStrings.try_parse(CIGAR, "1S7H9M1S");
julia> c.kind
InvalidHardClip::CIGARErrorType = 0x03CIGARStrings.unsafe_switch_memory — Function
unsafe_switch_memory(cigar::T, mem::ImmutableMemoryView{UInt8})::T where {T <: AbstractCIGAR}Create a new instance of typeof(cigar) equal to cigar, but using the new memory
mem which must be equal to the existing memory backing cigar.
This operation does not do any validation.
This function is unsafe, because it assumes that mem == MemoryView(cigar).
If this assumption is violated, any subsequent operation on the resulting AbstractCIGAR
may cause undefined behaviour.
Examples
julia> mem = MemoryView("5S12M1X8M10S");
julia> cigar_1 = CIGAR(mem);
julia> cigar_2 = unsafe_switch_memory(cigar_1, copy(mem));
julia> cigar_1 == cigar_2
true
julia> MemoryView(cigar_1) === MemoryView(cigar_2)
falseCIGARStrings.encode! — Function
encode!(
mem::MutableMemoryView{UInt8},
::Type{T <: AbstractCIGAR},
cigar::AbstractCIGAR
)::TEncode cigar as type T into the beginning of mem,
and return the result as a new T backed by mem.
Throw a LightBoundsError if mem is too short to contain the whole encoding.
Since the first bytes of mem are used to back the newly created cigar,
mutating mem after this function is being called will invalidate
the created cigar.
Examples
julia> c = CIGAR("3S5M1D19X2I6M3H");
julia> mem = MemoryView(zeros(UInt8, 60));
julia> bc = encode!(mem, BAMCIGAR, c)
BAMCIGAR(CIGAR("3S5M1D19X2I6M3H"))
julia> n_bytes = length(MemoryView(bc));
julia> ImmutableMemoryView(mem)[1:n_bytes] === MemoryView(bc)
trueCIGARStrings.encode_append! — Function
encode_append!(
v::Vector{UInt8},
::Type{T <: AbstractCIGAR},
cigar::AbstractCIGAR
)::TEncode cigar as type T by appending its representation to v,
and return the result as a new T backed by v's memory.
Since the returned cigar will use v as backing memory,
mutating v after this function has been called will invalidate
the created cigar.
Examples
julia> c = CIGAR("3S5M1D19X2I6M3H");
julia> v = UInt8[1, 2, 3, 4];
julia> bc = encode_append!(v, BAMCIGAR, c)
BAMCIGAR(CIGAR("3S5M1D19X2I6M3H"))
julia> v[1:4] == 1:4 # unchanged
true
julia> MemoryView(bc) === ImmutableMemoryView(v)[5:end]
trueCIGARStrings.OP_M — Constant
'M': Alignment match or mismatch
CIGARStrings.OP_I — Constant
'I': Insertion relative to the reference
CIGARStrings.OP_D — Constant
'D': Deletion relative to the reference
CIGARStrings.OP_S — Constant
'S': Soft clip (clipped sequence present in query)
CIGARStrings.OP_H — Constant
'H': Hard clip (clipped sequence not present in query)
CIGARStrings.OP_N — Constant
'N': Region skipped from the reference (usually an intron)
CIGARStrings.OP_P — Constant
'P': Padding
CIGARStrings.OP_X — Constant
'X': Alignment mismatch
CIGARStrings.OP_Eq — Constant
'=': Alignment match, not mismatch
CIGARStrings.PositionMapper — Type
PositionMapper{Params...}Public struct returned by pos_to_pos(..., pos) when !isa(pos, Integer).
This type may be referred to in stable code, but its API is defined by
pos_to_pos
See also: pos_to_pos
CIGARStrings.query — Function
Coordinate system representing the 1-based index in the query sequence
CIGARStrings.ref — Function
Coordinate system representing the 1-based index in the reference sequence
CIGARStrings.aln — Function
Coordinate system representing the 1-based index in the alignment (i.e. query/ref including gaps)
CIGARStrings.ref_length — Function
ref_length(::AbstractCIGAR)::IntGet the number of biosymbols in the reference of the CIGAR. This is the same
as the lengths of all CIGARElements of type M, D, N, = and X.
See also: query_length, aln_length
Example
julia> ref_length(CIGAR("1S5M2D6M2I5M"))
18CIGARStrings.query_length — Function
query_length(::AbstractCIGAR)::IntGet the number of biosymbols in the query of the CIGAR. This is the same
as the lengths of all CIGARElements of type M, I, S, = and X.
See also: ref_length, aln_length
Example
julia> query_length(CIGAR("1S5M2D6M2I5M"))
19CIGARStrings.aln_length — Function
aln_length(::AbstractCIGAR)::IntGet the number of biosymbols spanned by the alignment of the CIGAR. Clips,
padding and skips are not part of the alignment, but still part of the CIGAR.
Therefore, the alignment length is the same as the lengths of all CIGARElements
of type M, I, D, = and X.
See also: query_length, ref_length
Example
julia> aln_length(CIGAR("1S5M2D6M2I5M"))
20CIGARStrings.aln_identity — Function
aln_identity(::AbstractCIGAR, mismatches::Int)::Float64Compute the alignment identity of the AbstractCIGAR,
computed as the number of matches
(not mismatches) divided by alignment length.
Since the CIGAR itself may not provide information about the precise number of
mismatches, the amount of mismatches is an argument.
The result is always in [0.0, 1.0].
Example
julia> aln_identity(CIGAR("3M1D3M1I2M"), 2)
0.6CIGARStrings.count_matches — Function
count_matches(::AbstractCIGAR, mismatches::Integer)::IntCount the number of matches in the AbstractCIGAR, given the input number
of mismatches.
The mismatches argument is necessary, because OP_M is ambiguous, and
can encode any combination of matches and mismatches.
If mismatches is not in X:(M+X) where M is the number of symbols
marked OP_M, and X the number of OP_X, throw a DomainError.
Examples
julia> c = CIGAR("3S9M2I15=6X7=10M3S"); # 19M, 22=, 6X
julia> count_matches(c, 8)
39
julia> count_matches(c, 25)
22
julia> count_matches(c, 5) # lower bound is 6
ERROR: DomainError
[...]
julia> count_matches(c, 26) # upper bound is 25
ERROR: DomainError
[...]CIGARStrings.query_to_ref — Function
query_to_ref(x::AbstractCIGAR, pos::Integer)::TranslationGet the 1-based reference position aligning to query position pos.
See Translation for more details.
Examples
julia> c = CIGAR("3M2D1M4I2M");
julia> query_to_ref(c, 4)
Translation(pos, 6)CIGARStrings.query_to_aln — Function
query_to_aln(x::AbstractCIGAR, pos::Integer)::TranslationGet the 1-based alignment position aligning to query position pos.
See Translation for more details.
Examples
julia> c = CIGAR("3M2D1M4I2M");
julia> query_to_aln(c, 8)
Translation(pos, 10)CIGARStrings.ref_to_query — Function
ref_to_query(x::AbstractCIGAR, pos::Integer)::TranslationGet the 1-based query position aligning to reference position pos.
See Translation for more details.
Examples
julia> c = CIGAR("3M2D1M4I2M");
julia> ref_to_query(c, 7)
Translation(pos, 9)CIGARStrings.ref_to_aln — Function
ref_to_aln(x::AbstractCIGAR, pos::Integer)::TranslationGet the 1-based alignment position aligning to reference position pos.
See Translation for more details.
Examples
julia> c = CIGAR("3M2D1M4I2M");
julia> ref_to_aln(c, 7)
Translation(pos, 11)CIGARStrings.aln_to_query — Function
aln_to_query(x::AbstractCIGAR, pos::Integer)::TranslationGet the 1-based query position aligning to alignment position pos.
See Translation for more details.
Examples
julia> c = CIGAR("3M2D1M4I2M");
julia> aln_to_query(c, 10)
Translation(pos, 8)CIGARStrings.aln_to_ref — Function
aln_to_ref(x::AbstractCIGAR, pos::Integer)::TranslationGet the 1-based reference position aligning to alignment position pos.
See Translation for more details.
Examples
julia> c = CIGAR("3M2D1M4I2M");
julia> aln_to_ref(c, 9)
Translation(gap, 6)CIGARStrings.normalize — Function
normalize(cigar::T)::T where {T <: AbstractCIGAR}Create a new AbstractCIGAR, where the same alignment as cigar
is written in its canonical form.
The normalization process makes the following changes:
OP_EqandOP_Xare converted toOP_MOP_PandOP_Hoperations are removedConsecutive elements with the same operations are merged
The resulting cigar therefore only contain the operations:
M, I, D, S, N.
If is_compatible(a, b), and a and b are of the same type,
then normalize(a) == normalize(b) is guaranteed, but not the other
way around.
Since a normalized cigar represents the same alignment, normalization
will not affect position translation e.g. using pos_to_pos.
If the merging of two consecutive elements would result in a CIGARElement
with a length of more than 268435455, throw a CIGARError with kind
Errors.IntegerOverflow.
See also: normalize!, unsafe_normalize, is_compatible
Examples
julia> normalize(CIGAR("2H2S3=1X2=4P3=1=1I2X4H"))
CIGAR("2S10M1I2M")
julia> normalize(CIGAR("5M1D8M1I7M"))
CIGAR("5M1D8M1I7M")CIGARStrings.normalize! — Function
normalize!(cigar::T, mem::MutableMemoryView{UInt8})::T where {T <: AbstractCIGAR}Same as normalize, but uses allocation of mem instead of allocating
a new array.
Throws a LightBoundsError if mem cannot hold the normalized cigar.
A cigar after normalization is guaranteed to use at most as much memory
as cigar before normalization, so if length(mem) ≥ length(MemoryView(cigar)),
a bounds error cannot happen.
If mem aliases MemoryView(cigar),
this function may return garbage results.
Use unsafe_normalize to re-use the memory of a cigar
for the normalization result.
See also: normalize, unsafe_normalize, is_compatible
Examples
julia> c = CIGAR("1M2M1=2=3M");
julia> mem = MemoryView([0x00, 0x00]);
julia> c2 = normalize!(c, mem)
CIGAR("9M")
julia> MemoryView(c2) === ImmutableMemoryView(mem)
true
julia> c2 = normalize!(c, MemoryView([0x00]))
ERROR: LightBoundsErrors.LightBoundsError: out-of-bounds indexing: `collection[2]`, where:
[...]CIGARStrings.unsafe_normalize — Function
unsafe_normalize(cigar::T)::T where {T <: AbstractCIGAR}Create a new normalized cigar, re-using the memory of the input cigar.
The normalization process is the same as normalize.
Calling unsafe_normalize on a cigar will mutate its memory, even
though a cigar is constructed from an ImmutableMemoryView.
Callers must ensure that there are no other references to the memory
of cigar which assumes that the memory is immutable.
For example, calling this function on a CIGAR constructed from a
String risks undefined behaviour, since observing Strings being
mutated is undefined behaviour in Julia.
See also: normalize, normalize!, is_compatible
Examples
julia> c = CIGAR(collect(b"2H1X1=2H"));
julia> mem = MemoryView(c);
julia> c2 = unsafe_normalize(c)
CIGAR("2S2M2S")
julia> MemoryView(c2).ref === mem.ref
true