Pairwise Alignemnt
Overview
Pairwise alignment is a sequence alignment between two sequences. BioAlignments.jl implements several pairwise alignment algorithms that maximize alignment score or minimize alignment cost.
Generally, a pairwise alignment problem has two factors: an alignment type and a score/cost model. The alignment type specifies the alignment range (e.g. global, local, etc.) and the score/cost model specifies parameters of the alignment operations.
pairalign
is a funtion to run alignment, which is exported from the BioAlignments
module. It takes an alignment type as its first argument, then two sequences to align, and finally a score model. Currently, the following four types of alignments are supported:
GlobalAlignment
: global-to-global alignmentSemiGlobalAlignment
: local-to-global alignmentLocalAlignment
: local-to-local alignmentOverlapAlignment
: end-free alignment
For scoring model, AffineGapScoreModel
is currently supported. It imposes an affine gap penalty for insertions and deletions: gap_open + k * gap_extend
for a consecutive insertion/deletion of length k
. The affine gap penalty is flexible enough to create a constant and linear scoring model. Setting gap_extend = 0
or gap_open = 0
, they are equivalent to the constant or linear gap penalty, respectively. The first argument of AffineGapScoreModel
can be a substitution matrix like AffineGapScoreModel(BLOSUM62, gap_open=-10, gap_extend=-1)
. For details on substitution matrices, see the Substitution matrices section.
Alignment type can also be a distance of two sequences:
EditDistance
LevenshteinDistance
HammingDistance
In this alignment, CostModel
is used instead of AffineGapScoreModel
to define cost of substitution, insertion, and deletion:
julia> costmodel = CostModel(match=0, mismatch=1, insertion=1, deletion=1); julia> pairalign(EditDistance(), "abcd", "adcde", costmodel) Bio.Align.PairwiseAlignmentResult{Int64,String,String}: distance: 2 seq: 1 abcd- 4 | || ref: 1 adcde 5
Operations on pairwise alignment
pairalign
returns a PairwiseAlignmentResult
object and some accessors are provided for it.
#
BioAlignments.score
— Function.
score(alignment_result)
Return score of alignment.
#
BioCore.distance
— Function.
distance(alignment_result)
Retrun distance of alignment.
#
BioAlignments.hasalignment
— Function.
hasalignment(alignment_result)
Check if alignment is stored or not.
#
BioAlignments.alignment
— Function.
alignment(alignment_result)
Return alignment if any.
See also: hasalignment
Pairwise alignment also implements some useful operations on it.
#
BioAlignments.count_matches
— Function.
count_matches(aln)
Count the number of matching positions.
#
BioAlignments.count_mismatches
— Function.
count_mismatches(aln)
Count the number of mismatching positions.
#
BioAlignments.count_insertions
— Function.
count_insertions(aln)
Count the number of inserting positions.
#
BioAlignments.count_deletions
— Function.
count_deletions(aln)
Count the number of deleting positions.
#
BioAlignments.count_aligned
— Function.
count_aligned(aln)
Count the number of aligned positions.
The example below shows a use case of these operations:
julia> s1 = dna"CCTAGGAGGG"; julia> s2 = dna"ACCTGGTATGATAGCG"; julia> scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1); julia> res = pairalign(GlobalAlignment(), s1, s2, scoremodel) # run pairwise alignment Bio.Align.PairwiseAlignmentResult{Int64,Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}},Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}}}: score: 13 seq: 0 -CCTAGG------AGGG 10 ||| || || | ref: 1 ACCT-GGTATGATAGCG 16 julia> score(res) # get the achieved score of this alignment 13 julia> aln = alignment(res) PairwiseAlignment{Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}},Bio.Seq.BioSequence{Bio.Seq.DNAAlphabet{4}}}: seq: 0 -CCTAGG------AGGG 10 ||| || || | ref: 1 ACCT-GGTATGATAGCG 16 julia> count_matches(aln) 8 julia> count_mismatches(aln) 1 julia> count_insertions(aln) 1 julia> count_deletions(aln) 7 julia> count_aligned(aln) 17 julia> collect(aln) # pairwise alignment is iterable 17-element Array{Tuple{Bio.Seq.DNA,Bio.Seq.DNA},1}: (-,A) (C,C) (C,C) (T,T) (A,-) (G,G) (G,G) (-,T) (-,A) (-,T) (-,G) (-,A) (-,T) (A,A) (G,G) (G,C) (G,G) julia> DNASequence([x for (x, _) in aln]) # create aligned `s1` with gaps 17nt DNA Sequence: -CCTAGG------AGGG julia> DNASequence([y for (_, y) in aln]) # create aligned `s2` with gaps 17nt DNA Sequence: ACCT-GGTATGATAGCG
Substitution matrices
A substitution matrix is a function of substitution score (or cost) from one symbol to other. Substitution value of submat
from x
to y
can be obtained by writing submat[x,y]
. In Bio.Align
, SubstitutionMatrix
and DichotomousSubstitutionMatrix
are two distinct types representing substitution matrices.
SubstitutionMatrix
is a general substitution matrix type that is a thin wrapper of regular matrix.
Some common substitution matrices are provided. For DNA and RNA, EDNAFULL
is defined:
julia> EDNAFULL Bio.Align.SubstitutionMatrix{Bio.Seq.DNA,Int64}: A C G T M R W S Y K V H D B N A 5 -4 -4 -4 1 1 1 -4 -4 -4 -1 -1 -1 -4 -2 C -4 5 -4 -4 1 -4 -4 1 1 -4 -1 -1 -4 -1 -2 G -4 -4 5 -4 -4 1 -4 1 -4 1 -1 -4 -1 -1 -2 T -4 -4 -4 5 -4 -4 1 -4 1 1 -4 -1 -1 -1 -2 M 1 1 -4 -4 -1 -2 -2 -2 -2 -4 -1 -1 -3 -3 -1 R 1 -4 1 -4 -2 -1 -2 -2 -4 -2 -1 -3 -1 -3 -1 W 1 -4 -4 1 -2 -2 -1 -4 -2 -2 -3 -1 -1 -3 -1 S -4 1 1 -4 -2 -2 -4 -1 -2 -2 -1 -3 -3 -1 -1 Y -4 1 -4 1 -2 -4 -2 -2 -1 -2 -3 -1 -3 -1 -1 K -4 -4 1 1 -4 -2 -2 -2 -2 -1 -3 -3 -1 -1 -1 V -1 -1 -1 -4 -1 -1 -3 -1 -3 -3 -1 -2 -2 -2 -1 H -1 -1 -4 -1 -1 -3 -1 -3 -1 -3 -2 -1 -2 -2 -1 D -1 -4 -1 -1 -3 -1 -1 -3 -3 -1 -2 -2 -1 -2 -1 B -4 -1 -1 -1 -3 -3 -3 -1 -1 -1 -2 -2 -2 -1 -1 N -2 -2 -2 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 (underlined values are default ones)
For amino acids, PAM (Point Accepted Mutation) and BLOSUM (BLOcks SUbstitution Matrix) matrices are defined:
julia> BLOSUM62 Bio.Align.SubstitutionMatrix{Bio.Seq.AminoAcid,Int64}: A R N D C Q E G H I L K M F P S T W Y V O U B J Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 0̲ 0̲ -2 0̲ -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 0̲ 0̲ -1 0̲ 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 0̲ 0̲ 3 0̲ 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 0̲ 0̲ 4 0̲ 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 0̲ 0̲ -3 0̲ -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0̲ 0̲ 0 0̲ 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 0̲ 0̲ 1 0̲ 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 0̲ 0̲ -1 0̲ -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0̲ 0̲ 0 0̲ 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 0̲ 0̲ -3 0̲ -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 0̲ 0̲ -4 0̲ -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0̲ 0̲ 0 0̲ 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 0̲ 0̲ -3 0̲ -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 0̲ 0̲ -3 0̲ -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 0̲ 0̲ -2 0̲ -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0̲ 0̲ 0 0̲ 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 0̲ 0̲ -1 0̲ -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 0̲ 0̲ -4 0̲ -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 0̲ 0̲ -3 0̲ -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 0̲ 0̲ -3 0̲ -2 -1 -4 O 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ U 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 0̲ 0̲ 4 0̲ 1 -1 -4 J 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 0̲ 0̲ 1 0̲ 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 0̲ 0̲ -1 0̲ -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 0̲ 0̲ -4 0̲ -4 -4 1 (underlined values are default ones)
Matrix | Constants |
---|---|
PAM | PAM30 , PAM70 , PAM250 |
BLOSUM | BLOSUM45 , BLOSUM50 , BLOSUM62 , BLOSUM80 , BLOSUM90 |
These matrices are downloaded from: ftp://ftp.ncbi.nih.gov/blast/matrices/.
SubstitutionMatrix
can be modified like a regular matrix:
julia> mysubmat = copy(BLOSUM62); # create a copy julia> mysubmat[AA_A,AA_R] # score of AA_A => AA_R substitution is -1 -1 julia> mysubmat[AA_A,AA_R] = -3 # set the score to -3 -3 julia> mysubmat[AA_A,AA_R] # the score is modified -3
Make sure to create a copy of the original matrix when you create a matrix from a predefined matrix. In the above case, BLOSUM62
is shared in the whole program and modification on it will affect any result that uses BLOSUM62
.
DichotomousSubstitutionMatrix
is a specialized matrix for matching or mismatching substitution. This is a preferable choice when performance is more important than flexibility because looking up score is faster than SubstitutionMatrix
.
julia> submat = DichotomousSubstitutionMatrix(1, -1) Bio.Align.DichotomousSubstitutionMatrix{Int64}: match = 1 mismatch = -1 julia> submat['A','A'] # match 1 julia> submat['A','B'] # mismatch -1