API Reference
Operations
BioAlignments.Operation
— Type.Alignment operation.
BioAlignments.OP_MATCH
— Constant.'M'
: non-specific match
BioAlignments.OP_INSERT
— Constant.'I'
: insertion into reference sequence
BioAlignments.OP_DELETE
— Constant.'D'
: deletion from reference sequence
BioAlignments.OP_SKIP
— Constant.'N'
: (typically long) deletion from the reference, e.g. due to RNA splicing
BioAlignments.OP_SOFT_CLIP
— Constant.'S'
: sequence removed from the beginning or end of the query sequence but stored
BioAlignments.OP_HARD_CLIP
— Constant.'H'
: sequence removed from the beginning or end of the query sequence and not stored
BioAlignments.OP_PAD
— Constant.'P'
: not currently supported, but present for SAM/BAM compatibility
BioAlignments.OP_SEQ_MATCH
— Constant.'='
: match operation with matching sequence positions
BioAlignments.OP_SEQ_MISMATCH
— Constant.'X'
: match operation with mismatching sequence positions
BioAlignments.OP_BACK
— Constant.'B'
: not currently supported, but present for SAM/BAM compatibility
BioAlignments.OP_START
— Constant.'0'
: indicate the start of an alignment within the reference and query sequence
BioAlignments.ismatchop
— Function.ismatchop(op::Operation)
Test if op
is a match operation (i.e. op ∈ (OP_MATCH, OP_SEQ_MATCH, OP_SEQ_MISMATCH)
).
BioAlignments.isinsertop
— Function.isinsertop(op::Operation)
Test if op
is a insertion operation (i.e. op ∈ (OP_INSERT, OP_SOFT_CLIP, OP_HARD_CLIP)
).
BioAlignments.isdeleteop
— Function.isdeleteop(op::Operation)
Test if op
is a deletion operation (i.e. op ∈ (OP_DELETE, OP_SKIP)
).
Alignments
BioAlignments.AlignmentAnchor
— Type.Alignment operation with anchoring positions.
BioAlignments.Alignment
— Type.Alignment of two sequences.
BioAlignments.Alignment
— Method.Alignment(anchors::Vector{AlignmentAnchor}, check=true)
Create an alignment object from a sequence of alignment anchors.
BioAlignments.Alignment
— Method.Alignment(cigar::AbstractString, seqpos=1, refpos=1)
Make an alignment object from a CIGAR string.
seqpos
and refpos
specify the starting positions of two sequences.
BioAlignments.seq2ref
— Method.seq2ref(aln::Alignment, i::Integer)::Tuple{Int,Operation}
Map a position i
from sequence to reference.
BioAlignments.ref2seq
— Method.ref2seq(aln::Alignment, i::Integer)::Tuple{Int,Operation}
Map a position i
from reference to sequence.
BioAlignments.cigar
— Method.cigar(aln::Alignment)
Make a CIGAR string encoding of aln
.
This is not entirely lossless as it discards the alignments start positions.
Substitution matrices
Supertype of substitution matrix.
The required method:
Base.getindex(submat, x, y)
: substitution score/cost fromx
toy
BioAlignments.SubstitutionMatrix
— Type.Substitution matrix.
Dichotomous substitution matrix.
BioAlignments.EDNAFULL
— Constant.EDNAFULL (or NUC4.4) substitution matrix
BioAlignments.PAM30
— Constant.PAM30 substitution matrix
BioAlignments.PAM70
— Constant.PAM70 substitution matrix
BioAlignments.PAM250
— Constant.PAM250 substitution matrix
BioAlignments.BLOSUM45
— Constant.BLOSUM45 substitution matrix
BioAlignments.BLOSUM50
— Constant.BLOSUM50 substitution matrix
BioAlignments.BLOSUM62
— Constant.BLOSUM62 substitution matrix
BioAlignments.BLOSUM80
— Constant.BLOSUM80 substitution matrix
BioAlignments.BLOSUM90
— Constant.BLOSUM90 substitution matrix
Pairwise alignments
BioAlignments.PairwiseAlignment
— Type.Pairwise alignment
Base.count
— Method.count(aln::PairwiseAlignment, target::Operation)
Count the number of positions where the target
operation is applied.
BioAlignments.count_matches
— Function.count_matches(aln)
Count the number of matching positions.
BioAlignments.count_mismatches
— Function.count_mismatches(aln)
Count the number of mismatching positions.
BioAlignments.count_insertions
— Function.count_insertions(aln)
Count the number of inserting positions.
BioAlignments.count_deletions
— Function.count_deletions(aln)
Count the number of deleting positions.
BioAlignments.count_aligned
— Function.count_aligned(aln)
Count the number of aligned positions.
BioAlignments.GlobalAlignment
— Type.Global-global alignment with end gap penalties.
Global-local alignment.
BioAlignments.OverlapAlignment
— Type.Global-global alignment without end gap penalties.
BioAlignments.LocalAlignment
— Type.Local-local alignment.
BioAlignments.EditDistance
— Type.Edit distance.
BioAlignments.HammingDistance
— Type.Hamming distance.
A special case of EditDistance
with the costs of insertion and deletion are infinitely large.
Levenshtein distance.
A special case of EditDistance
with the costs of mismatch, insertion, and deletion are 1.
BioAlignments.AbstractScoreModel
— Type.Supertype of score model.
AffineGapScoreModel(submat, gap_open, gap_extend)
AffineGapScoreModel(submat, gap_open=, gap_extend=)
AffineGapScoreModel(match=, mismatch=, gap_open=, gap_extend=)
Affine gap scoring model.
This creates an affine gap scroing model object for alignment from a substitution matrix (submat
), a gap opening score (gap_open
), and a gap extending score (gap_extend
). A consecutive gap of length k
has a score of gap_open + gap_extend * k
. Note that both of the gap scores should be non-positive. As a shorthand of creating a dichotomous substitution matrix, you can write as, for example, AffineGapScoreModel(match=5, mismatch=-3, gap_open=-2, gap_extend=-1)
.
Example
using BioSequences
using BioAlignments
# create an affine gap scoring model from a predefined substitution
# matrix and gap opening/extending scores.
affinegap = AffineGapScoreModel(BLOSUM62, gap_open=-10, gap_extend=-1)
# run global alignment between two amino acid sequenecs
pairalign(GlobalAlignment(), aa"IDGAAGQQL", aa"IDGATGQL", affinegap)
See also: SubstitutionMatrix
, pairalign
, CostModel
BioAlignments.AbstractCostModel
— Type.Supertype of cost model.
BioAlignments.CostModel
— Type.CostModel(submat, insertion, deletion)
CostModel(submat, insertion=, deletion=)
CostModel(match=, mismatch=, insertion=, deletion=)
Cost model.
This creates a cost model object for alignment from substitution matrix (submat
), an insertion cost (insertion
), and a deletion cost (deletion
). Note that both of the insertion and deletion costs should be non-negative. As a shorthand of creating a dichotomous substitution matrix, you can write as, for example, CostModel(match=0, mismatch=1, insertion=2, deletion=2)
.
Example
using BioAlignments
# create a cost model from a substitution matrix and indel costs
cost = CostModel(ones(128, 128) - eye(128), insertion=.5, deletion=.5)
# run global alignment to minimize edit distance
pairalign(EditDistance(), "intension", "execution", cost)
See also: SubstitutionMatrix
, pairalign
, AffineGapScoreModel
Result of pairwise alignment
BioAlignments.pairalign
— Function.pairalign(type, seq, ref, model, [options...])
Run pairwise alignment between two sequences: seq
and ref
.
Available type
s are:
GlobalAlignment()
LocalAlignment()
SemiGlobalAlignment()
OverlapAlignment()
EditDistance()
LevenshteinDistance()
HammingDistance()
GlobalAlignment
, LocalAlignment
, SemiGlobalAlignment
, and OverlapAlignment
are problem that maximizes alignment score between two sequences. Therefore, model
should be an instance of AbstractScoreModel
(e.g. AffineGapScoreModel
).
EditDistance
, LevenshteinDistance
, and HammingDistance
are problem that minimizes alignment cost between two sequences. As for EditDistance
, model
should be an instance of AbstractCostModel
(e.g. CostModel
). LevenshteinDistance
and HammingDistance
have predefined a cost model, so users cannot specify a cost model for these alignment types.
When you pass the score_only=true
or distance_only=true
option to pairalign
, the result of pairwise alignment holds alignment score/distance only. This may enable some algorithms to run faster than calculating full alignment result. Other available options
are documented for each alignemnt type.
Example
using BioSequences
using BioAlignments
# create affine gap scoring model
affinegap = AffineGapScoreModel(
match=5,
mismatch=-4,
gap_open=-5,
gap_extend=-3
)
# run global alignment between two DNA sequences
pairalign(GlobalAlignment(), dna"AGGTAG", dna"ATTG", affinegap)
# run local alignment between two DNA sequences
pairalign(LocalAlignment(), dna"AGGTAG", dna"ATTG", affinegap)
# you cannot specify a cost model in LevenshteinDistance
pairalign(LevenshteinDistance(), dna"AGGTAG", dna"ATTG")
See also: AffineGapScoreModel
, CostModel
BioAlignments.score
— Function.score(alignment_result)
Return score of alignment.
BioCore.distance
— Function.distance(alignment_result)
Retrun distance of alignment.
BioAlignments.alignment
— Function.alignment(alignment_result)
Return alignment if any.
See also: hasalignment
BioAlignments.hasalignment
— Function.hasalignment(alignment_result)
Check if alignment is stored or not.
BioAlignments.seq2ref
— Method.seq2ref(aln::PairwiseAlignment, i::Integer)::Tuple{Int,Operation}
Map a position i
from the first sequence to the second.
BioAlignments.ref2seq
— Method.ref2seq(aln::PairwiseAlignment, i::Integer)::Tuple{Int,Operation}
Map a position i
from the second sequence to the first.
I/O
SAM
BioAlignments.SAM.Reader
— Type.SAM.Reader(input::IO)
Create a data reader of the SAM file format.
Arguments
input
: data source
BioAlignments.SAM.header
— Function.header(reader::Reader)::Header
Get the header of reader
.
BioAlignments.SAM.Header
— Type.SAM.Header()
Create an empty header.
Base.find
— Method.find(header::Header, key::AbstractString)::Vector{MetaInfo}
Find metainfo objects satisfying SAM.tag(metainfo) == key
.
BioAlignments.SAM.Writer
— Type.Writer(output::IO, header::Header=Header())
Create a data writer of the SAM file format.
Arguments
output
: data sinkheader=Header()
: SAM header object
BioAlignments.SAM.MetaInfo
— Type.MetaInfo(str::AbstractString)
Create a SAM metainfo from str
.
Examples
julia> SAM.MetaInfo("@CO some comment")
BioAlignments.SAM.MetaInfo:
tag: CO
value: some comment
julia> SAM.MetaInfo("@SQ SN:chr1 LN:12345")
BioAlignments.SAM.MetaInfo:
tag: SQ
value: SN=chr1 LN=12345
MetaInfo(tag::AbstractString, value)
Create a SAM metainfo with tag
and value
.
tag
is a two-byte ASCII string. If tag
is "CO"
, value
must be a string; otherwise, value
is an iterable object with key and value pairs.
Examples
julia> SAM.MetaInfo("CO", "some comment")
BioAlignments.SAM.MetaInfo:
tag: CO
value: some comment
julia> string(ans)
"@CO some comment"
julia> SAM.MetaInfo("SQ", ["SN" => "chr1", "LN" => 12345])
BioAlignments.SAM.MetaInfo:
tag: SQ
value: SN=chr1 LN=12345
julia> string(ans)
"@SQ SN:chr1 LN:12345"
BioAlignments.SAM.iscomment
— Function.iscomment(metainfo::MetaInfo)::Bool
Test if metainfo
is a comment (i.e. its tag is "CO").
BioAlignments.SAM.tag
— Function.tag(metainfo::MetaInfo)::String
Get the tag of metainfo
.
BioAlignments.SAM.value
— Function.value(metainfo::MetaInfo)::String
Get the value of metainfo
as a string.
BioAlignments.SAM.keyvalues
— Function.keyvalues(metainfo::MetaInfo)::Vector{Pair{String,String}}
Get the values of metainfo
as string pairs.
BioAlignments.SAM.Record
— Type.SAM.Record()
Create an unfilled SAM record.
SAM.Record(data::Vector{UInt8})
Create a SAM record from data
. This function verifies the format and indexes fields for accessors. Note that the ownership of data
is transferred to a new record object.
SAM.Record(str::AbstractString)
Create a SAM record from str
. This function verifies the format and indexes fields for accessors.
BioAlignments.SAM.flag
— Function.flag(record::Record)::UInt16
Get the bitwise flag of record
.
BioAlignments.SAM.ismapped
— Function.ismapped(record::Record)::Bool
Test if record
is mapped.
BioAlignments.SAM.isprimary
— Function.isprimary(record::Record)::Bool
Test if record
is a primary line of the read.
This is equivalent to flag(record) & 0x900 == 0
.
BioAlignments.SAM.refname
— Function.refname(record::Record)::String
Get the reference sequence name of record
.
BioAlignments.SAM.position
— Function.position(record::Record)::Int
Get the 1-based leftmost mapping position of record
.
BioAlignments.SAM.rightposition
— Function.rightposition(record::Record)::Int
Get the 1-based rightmost mapping position of record
.
BioAlignments.SAM.isnextmapped
— Function.isnextmapped(record::Record)::Bool
Test if the mate/next read of record
is mapped.
BioAlignments.SAM.nextrefname
— Function.nextrefname(record::Record)::String
Get the reference name of the mate/next read of record
.
BioAlignments.SAM.nextposition
— Function.nextposition(record::Record)::Int
Get the position of the mate/next read of record
.
BioAlignments.SAM.mappingquality
— Function.mappingquality(record::Record)::UInt8
Get the mapping quality of record
.
BioAlignments.SAM.cigar
— Function.cigar(record::Record)::String
Get the CIGAR string of record
.
BioAlignments.SAM.alignment
— Function.alignment(record::Record)::BioAlignments.Alignment
Get the alignment of record
.
BioAlignments.SAM.alignlength
— Function.alignlength(record::Record)::Int
Get the alignment length of record
.
BioAlignments.SAM.tempname
— Function.tempname(record::Record)::String
Get the query template name of record
.
BioAlignments.SAM.templength
— Function.templength(record::Record)::Int
Get the template length of record
.
BioAlignments.SAM.sequence
— Function.sequence(record::Record)::BioSequences.DNASequence
Get the segment sequence of record
.
sequence(::Type{String}, record::Record)::String
Get the segment sequence of record
as String
.
BioAlignments.SAM.seqlength
— Function.seqlength(record::Record)::Int
Get the sequence length of record
.
BioAlignments.SAM.quality
— Function.quality(record::Record)::Vector{UInt8}
Get the Phred-scaled base quality of record
.
quality(::Type{String}, record::Record)::String
Get the ASCII-encoded base quality of record
.
BioAlignments.SAM.auxdata
— Function.auxdata(record::Record)::Dict{String,Any}
Get the auxiliary data (optional fields) of record
.
BioAlignments.SAM.FLAG_PAIRED
— Constant.0x0001: the read is paired in sequencing, no matter whether it is mapped in a pair
BioAlignments.SAM.FLAG_PROPER_PAIR
— Constant.0x0002: the read is mapped in a proper pair
BioAlignments.SAM.FLAG_UNMAP
— Constant.0x0004: the read itself is unmapped; conflictive with SAM.FLAG_PROPER_PAIR
BioAlignments.SAM.FLAG_MUNMAP
— Constant.0x0008: the mate is unmapped
BioAlignments.SAM.FLAG_REVERSE
— Constant.0x0010: the read is mapped to the reverse strand
BioAlignments.SAM.FLAG_MREVERSE
— Constant.0x0020: the mate is mapped to the reverse strand
BioAlignments.SAM.FLAG_READ1
— Constant.0x0040: this is read1
BioAlignments.SAM.FLAG_READ2
— Constant.0x0080: this is read2
BioAlignments.SAM.FLAG_SECONDARY
— Constant.0x0100: not primary alignment
BioAlignments.SAM.FLAG_QCFAIL
— Constant.0x0200: QC failure
BioAlignments.SAM.FLAG_DUP
— Constant.0x0400: optical or PCR duplicate
BioAlignments.SAM.FLAG_SUPPLEMENTARY
— Constant.0x0800: supplementary alignment
BAM
BioAlignments.BAM.Reader
— Type.BAM.Reader(input::IO; index=nothing)
Create a data reader of the BAM file format.
Arguments
input
: data sourceindex=nothing
: filepath to a random access index (currently bai is supported)
BioAlignments.BAM.header
— Function.header(reader::Reader; fillSQ::Bool=false)::SAM.Header
Get the header of reader
.
If fillSQ
is true
, this function fills missing "SQ" metainfo in the header.
BioAlignments.BAM.Writer
— Type.BAM.Writer(output::BGZFStream, header::SAM.Header)
Create a data writer of the BAM file format.
Arguments
output
: data sinkheader
: SAM header object
BioAlignments.BAM.Record
— Type.BAM.Record()
Create an unfilled BAM record.
BioAlignments.BAM.flag
— Function.flag(record::Record)::UInt16
Get the bitwise flag of record
.
BioAlignments.BAM.ismapped
— Function.ismapped(record::Record)::Bool
Test if record
is mapped.
BioAlignments.BAM.isprimary
— Function.isprimary(record::Record)::Bool
Test if record
is a primary line of the read.
This is equivalent to flag(record) & 0x900 == 0
.
BioAlignments.BAM.ispositivestrand
— Function.ispositivestrand(record::Record)::Bool
Test if record
is aligned to the positive strand.
This is equivalent to flag(record) & 0x10 == 0
.
BioAlignments.BAM.refid
— Function.refid(record::Record)::Int
Get the reference sequence ID of record
.
The ID is 1-based (i.e. the first sequence is 1) and is 0 for a record without a mapping position.
See also: BAM.rname
BioAlignments.BAM.refname
— Function.refname(record::Record)::String
Get the reference sequence name of record
.
See also: BAM.refid
BioAlignments.BAM.position
— Function.position(record::Record)::Int
Get the 1-based leftmost mapping position of record
.
BioAlignments.BAM.rightposition
— Function.rightposition(record::Record)::Int
Get the 1-based rightmost mapping position of record
.
BioAlignments.BAM.isnextmapped
— Function.isnextmapped(record::Record)::Bool
Test if the mate/next read of record
is mapped.
BioAlignments.BAM.nextrefid
— Function.nextrefid(record::Record)::Int
Get the next/mate reference sequence ID of record
.
BioAlignments.BAM.nextrefname
— Function.nextrefname(record::Record)::String
Get the reference name of the mate/next read of record
.
BioAlignments.BAM.nextposition
— Function.nextposition(record::Record)::Int
Get the 1-based leftmost mapping position of the next/mate read of record
.
BioAlignments.BAM.mappingquality
— Function.mappingquality(record::Record)::UInt8
Get the mapping quality of record
.
BioAlignments.BAM.cigar
— Function.cigar(record::Record)::String
Get the CIGAR string of record
.
Note that in the BAM specification, the field called cigar
typically stores the cigar string of the record. However, this is not always true, sometimes the true cigar is very long, and due to some constraints of the BAM format, the actual cigar string is stored in an extra tag: CG:B,I
, and the cigar
field stores a pseudo-cigar string.
Calling this method with checkCG
set to true
(default) this method will always yield the true cigar string, because this is probably what you want the vast majority of the time.
If you have a record that stores the true cigar in a CG:B,I
tag, but you still want to access the pseudo-cigar that is stored in the cigar
field of the BAM record, then you can set checkCG to false
.
See also BAM.cigar_rle
.
BioAlignments.BAM.cigar_rle
— Function.cigar_rle(record::Record, checkCG::Bool = true)::Tuple{Vector{BioAlignments.Operation},Vector{Int}}
Get a run-length encoded tuple (ops, lens)
of the CIGAR string in record
.
Note that in the BAM specification, the field called cigar
typically stores the cigar string of the record. However, this is not always true, sometimes the true cigar is very long, and due to some constraints of the BAM format, the actual cigar string is stored in an extra tag: CG:B,I
, and the cigar
field stores a pseudo-cigar string.
Calling this method with checkCG
set to true
(default) this method will always yield the true cigar string, because this is probably what you want the vast majority of the time.
If you have a record that stores the true cigar in a CG:B,I
tag, but you still want to access the pseudo-cigar that is stored in the cigar
field of the BAM record, then you can set checkCG to false
.
See also BAM.cigar
.
BioAlignments.BAM.alignment
— Function.alignment(record::Record)::BioAlignments.Alignment
Get the alignment of record
.
BioAlignments.BAM.alignlength
— Function.alignlength(record::Record)::Int
Get the alignment length of record
.
BioAlignments.BAM.tempname
— Function.tempname(record::Record)::String
Get the query template name of record
.
BioAlignments.BAM.templength
— Function.templength(record::Record)::Int
Get the template length of record
.
BioAlignments.BAM.sequence
— Function.sequence(record::Record)::BioSequences.DNASequence
Get the segment sequence of record
.
BioAlignments.BAM.seqlength
— Function.seqlength(record::Record)::Int
Get the sequence length of record
.
BioAlignments.BAM.quality
— Function.quality(record::Record)::Vector{UInt8}
Get the base quality of record
.
BioAlignments.BAM.auxdata
— Function.auxdata(record::Record)::BAM.AuxData
Get the auxiliary data of record
.
BioAlignments.BAM.BAI
— Type.BAI(filename::AbstractString)
Load a BAI index from filename
.
BAI(input::IO)
Load a BAI index from input
.