Pairwise alignment
Overview
Pairwise alignment is a sequence alignment between two sequences. BioAlignments.jl implements several pairwise alignment algorithms that maximize alignment score or minimize alignment cost. If you are interested in handling the results of pairwise alignments, it is highly recommended to read the Alignment representation chapter in advance to get used to the alignment representation.
Alignment types and scoring models
A pairwise alignment problem has two factors: an alignment type and a score/cost model. The alignment type specifies the alignment range (e.g. global, local, etc.) and the score/cost model specifies parameters of the alignment operations.
pairalign
is a function to run alignment, which is exported from the BioAlignments
module. It takes an alignment type as its first argument, then two sequences to align, and finally a score model. Currently, the following four types of alignments are supported:
GlobalAlignment
: global-to-global alignmentSemiGlobalAlignment
: local-to-global alignmentLocalAlignment
: local-to-local alignmentOverlapAlignment
: end-free alignment
For scoring model, AffineGapScoreModel
is currently supported. It imposes an affine gap penalty for insertions and deletions: gap_open + k * gap_extend
for a consecutive insertion/deletion of length k
. The affine gap penalty is flexible enough to create a constant and linear scoring model. Setting gap_extend = 0
or gap_open = 0
, they are equivalent to the constant or linear gap penalty, respectively. The first argument of AffineGapScoreModel
can be a substitution matrix like AffineGapScoreModel(BLOSUM62, gap_open=-10, gap_extend=-1)
. For details on substitution matrices, see the Substitution matrix types section.
Alignment type can also be a distance of two sequences:
EditDistance
LevenshteinDistance
HammingDistance
In this alignment, CostModel
is used instead of AffineGapScoreModel
to define cost of substitution, insertion, and deletion:
julia> costmodel = CostModel(match=0, mismatch=1, insertion=1, deletion=1);
julia> pairalign(EditDistance(), "abcd", "adcde", costmodel)
PairwiseAlignmentResult{Int64,String,String}:
distance: 2
seq: 1 abcd- 4
| ||
ref: 1 adcde 5
Operations on pairwise alignment
The example below shows a use case of some operations:
julia> s1 = dna"CCTAGGAGGG";
julia> s2 = dna"ACCTGGTATGATAGCG";
julia> scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1);
julia> res = pairalign(GlobalAlignment(), s1, s2, scoremodel) # run pairwise alignment
PairwiseAlignmentResult{Int64,BioSequences.BioSequence{BioSequences.DNAAlphabet{4}},BioSequences.BioSequence{BioSequences.DNAAlphabet{4}}}:
score: 13
seq: 0 -CCTAGG------AGGG 10
||| || || |
ref: 1 ACCT-GGTATGATAGCG 16
julia> score(res) # get the achieved score of this alignment
13
julia> aln = alignment(res)
PairwiseAlignment{BioSequences.BioSequence{BioSequences.DNAAlphabet{4}},BioSequences.BioSequence{BioSequences.DNAAlphabet{4}}}:
seq: 0 -CCTAGG------AGGG 10
||| || || |
ref: 1 ACCT-GGTATGATAGCG 16
julia> count_matches(aln)
8
julia> count_mismatches(aln)
1
julia> count_insertions(aln)
1
julia> count_deletions(aln)
7
julia> count_aligned(aln)
17
julia> collect(aln) # pairwise alignment is iterable
17-element Array{Tuple{BioSymbols.DNA,BioSymbols.DNA},1}:
(DNA_Gap, DNA_A)
(DNA_C, DNA_C)
(DNA_C, DNA_C)
(DNA_T, DNA_T)
(DNA_A, DNA_Gap)
(DNA_G, DNA_G)
(DNA_G, DNA_G)
(DNA_Gap, DNA_T)
(DNA_Gap, DNA_A)
(DNA_Gap, DNA_T)
(DNA_Gap, DNA_G)
(DNA_Gap, DNA_A)
(DNA_Gap, DNA_T)
(DNA_A, DNA_A)
(DNA_G, DNA_G)
(DNA_G, DNA_C)
(DNA_G, DNA_G)
julia> DNASequence([x for (x, _) in aln]) # create aligned `s1` with gaps
17nt DNA Sequence:
-CCTAGG------AGGG
julia> DNASequence([y for (_, y) in aln]) # create aligned `s2` with gaps
17nt DNA Sequence:
ACCT-GGTATGATAGCG
Substitution matrix types
A substitution matrix is a function of substitution score (or cost) from one symbol to other. Substitution value of submat
from x
to y
can be obtained by writing submat[x,y]
. In BioAlignments.jl, SubstitutionMatrix
and DichotomousSubstitutionMatrix
are two distinct types representing substitution matrices.
SubstitutionMatrix
is a general substitution matrix type that is a thin wrapper of regular matrix.
Some common substitution matrices are provided. For DNA and RNA, EDNAFULL
is defined:
julia> EDNAFULL
SubstitutionMatrix{BioSymbols.DNA,Int64}:
A C M G R S V T W Y H K D B N
A 5 -4 1 -4 1 -4 -1 -4 1 -4 -1 -4 -1 -4 -2
C -4 5 1 -4 -4 1 -1 -4 -4 1 -1 -4 -4 -1 -2
M 1 1 -1 -4 -2 -2 -1 -4 -2 -2 -1 -4 -3 -3 -1
G -4 -4 -4 5 1 1 -1 -4 -4 -4 -4 1 -1 -1 -2
R 1 -4 -2 1 -1 -2 -1 -4 -2 -4 -3 -2 -1 -3 -1
S -4 1 -2 1 -2 -1 -1 -4 -4 -2 -3 -2 -3 -1 -1
V -1 -1 -1 -1 -1 -1 -1 -4 -3 -3 -2 -3 -2 -2 -1
T -4 -4 -4 -4 -4 -4 -4 5 1 1 -1 1 -1 -1 -2
W 1 -4 -2 -4 -2 -4 -3 1 -1 -2 -1 -2 -1 -3 -1
Y -4 1 -2 -4 -4 -2 -3 1 -2 -1 -1 -2 -3 -1 -1
H -1 -1 -1 -4 -3 -3 -2 -1 -1 -1 -1 -3 -2 -2 -1
K -4 -4 -4 1 -2 -2 -3 1 -2 -2 -3 -1 -1 -1 -1
D -1 -4 -3 -1 -1 -3 -2 -1 -1 -3 -2 -1 -1 -2 -1
B -4 -1 -3 -1 -3 -1 -2 -1 -3 -1 -2 -1 -2 -1 -1
N -2 -2 -1 -2 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1
(underlined values are default ones)
For amino acids, PAM (Point Accepted Mutation) and BLOSUM (BLOcks SUbstitution Matrix) matrices are defined:
julia> BLOSUM62
SubstitutionMatrix{BioSymbols.AminoAcid,Int64}:
A R N D C Q E G H I L K M F P S T W Y V O U B J Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 0̲ 0̲ -2 0̲ -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 0̲ 0̲ -1 0̲ 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 0̲ 0̲ 3 0̲ 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 0̲ 0̲ 4 0̲ 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 0̲ 0̲ -3 0̲ -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0̲ 0̲ 0 0̲ 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 0̲ 0̲ 1 0̲ 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 0̲ 0̲ -1 0̲ -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0̲ 0̲ 0 0̲ 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 0̲ 0̲ -3 0̲ -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 0̲ 0̲ -4 0̲ -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0̲ 0̲ 0 0̲ 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 0̲ 0̲ -3 0̲ -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 0̲ 0̲ -3 0̲ -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 0̲ 0̲ -2 0̲ -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0̲ 0̲ 0 0̲ 0 0 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 0̲ 0̲ -1 0̲ -1 0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 0̲ 0̲ -4 0̲ -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 0̲ 0̲ -3 0̲ -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 0̲ 0̲ -3 0̲ -2 -1 -4
O 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲
U 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 0̲ 0̲ 4 0̲ 1 -1 -4
J 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 0̲ 0̲ 1 0̲ 4 -1 -4
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 0̲ 0̲ -1 0̲ -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 0̲ 0̲ -4 0̲ -4 -4 1
(underlined values are default ones)
Matrix | Constants |
---|---|
PAM | PAM30 , PAM70 , PAM250 |
BLOSUM | BLOSUM45 , BLOSUM50 , BLOSUM62 , BLOSUM80 , BLOSUM90 |
These matrices are downloaded from: ftp://ftp.ncbi.nih.gov/blast/matrices/.
SubstitutionMatrix
can be modified like a regular matrix:
julia> mysubmat = copy(BLOSUM62); # create a copy
julia> mysubmat[AA_A,AA_R] # score of AA_A => AA_R substitution is -1
-1
julia> mysubmat[AA_A,AA_R] = -3 # set the score to -3
-3
julia> mysubmat[AA_A,AA_R] # the score is modified
-3
Make sure to create a copy of the original matrix when you create a matrix from a predefined matrix. In the above case, BLOSUM62
is shared in the whole program and modification on it will affect any result that uses BLOSUM62
.
DichotomousSubstitutionMatrix
is a specialized matrix for matching or mismatching substitution. This is a preferable choice when performance is more important than flexibility because looking up score is faster than SubstitutionMatrix
.
julia> submat = DichotomousSubstitutionMatrix(1, -1)
DichotomousSubstitutionMatrix{Int64}:
match = 1
mismatch = -1
julia> submat['A','A'] # match
1
julia> submat['A','B'] # mismatch
-1