Pairwise alignment
Overview
Pairwise alignment is a sequence alignment between two sequences. BioAlignments.jl implements several pairwise alignment algorithms that maximize alignment score or minimize alignment cost. If you are interested in handling the results of pairwise alignments, it is highly recommended to read the Alignment representation chapter in advance to get used to the alignment representation.
Alignment types and scoring models
A pairwise alignment problem has two factors: an alignment type and a score/cost model. The alignment type specifies the alignment range (e.g. global, local, etc.) and the score/cost model specifies parameters of the alignment operations.
pairalign
is a function to run alignment, which is exported from the BioAlignments
module. It takes an alignment type as its first argument, then two sequences to align, and finally a score model. Currently, the following four types of alignments are supported:
GlobalAlignment
: global-to-global alignmentSemiGlobalAlignment
: local-to-global alignmentLocalAlignment
: local-to-local alignmentOverlapAlignment
: end-free alignment
For scoring model, AffineGapScoreModel
is currently supported. It imposes an affine gap penalty for insertions and deletions: gap_open + k * gap_extend
for a consecutive insertion/deletion of length k
. The affine gap penalty is flexible enough to create a constant and linear scoring model. Setting gap_extend = 0
or gap_open = 0
, they are equivalent to the constant or linear gap penalty, respectively. The first argument of AffineGapScoreModel
can be a substitution matrix like AffineGapScoreModel(BLOSUM62, gap_open=-10, gap_extend=-1)
. For details on substitution matrices, see the Substitution matrix types section.
Alignment type can also be a distance of two sequences:
EditDistance
LevenshteinDistance
HammingDistance
In this alignment, CostModel
is used instead of AffineGapScoreModel
to define cost of substitution, insertion, and deletion:
julia> costmodel = CostModel(match=0, mismatch=1, insertion=1, deletion=1);
julia> pairalign(EditDistance(), "abcd", "adcde", costmodel)
PairwiseAlignmentResult{Int64, String, String}:
distance: 2
seq: 1 abcd- 4
| ||
ref: 1 adcde 5
Operations on pairwise alignment
The example below shows a use case of some operations:
julia> s1 = dna"CCTAGGAGGG";
julia> s2 = dna"ACCTGGTATGATAGCG";
julia> scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1);
julia> res = pairalign(GlobalAlignment(), s1, s2, scoremodel) # run pairwise alignment
PairwiseAlignmentResult{Int64, BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}:
score: 13
seq: 0 -CCTAGG------AGGG 10
||| || || |
ref: 1 ACCT-GGTATGATAGCG 16
julia> score(res) # get the achieved score of this alignment
13
julia> aln = alignment(res)
PairwiseAlignment{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}:
seq: 0 -CCTAGG------AGGG 10
||| || || |
ref: 1 ACCT-GGTATGATAGCG 16
julia> sequence(res)
10nt DNA Sequence:
CCTAGGAGGG
julia> count_matches(aln)
8
julia> count_mismatches(aln)
1
julia> count_insertions(aln)
1
julia> count_deletions(aln)
7
julia> count_aligned(aln)
17
julia> collect(aln) # pairwise alignment is iterable
17-element Vector{Tuple{DNA, DNA}}:
(DNA_Gap, DNA_A)
(DNA_C, DNA_C)
(DNA_C, DNA_C)
(DNA_T, DNA_T)
(DNA_A, DNA_Gap)
(DNA_G, DNA_G)
(DNA_G, DNA_G)
(DNA_Gap, DNA_T)
(DNA_Gap, DNA_A)
(DNA_Gap, DNA_T)
(DNA_Gap, DNA_G)
(DNA_Gap, DNA_A)
(DNA_Gap, DNA_T)
(DNA_A, DNA_A)
(DNA_G, DNA_G)
(DNA_G, DNA_C)
(DNA_G, DNA_G)
julia> LongDNA{4}([x for (x, _) in aln]) # create aligned `s1` with gaps
17nt DNA Sequence:
-CCTAGG------AGGG
julia> LongDNA{4}([y for (_, y) in aln]) # create aligned `s2` with gaps
17nt DNA Sequence:
ACCT-GGTATGATAGCG
Substitution matrix types
A substitution matrix is a function of substitution score (or cost) from one symbol to other. Substitution value of submat
from x
to y
can be obtained by writing submat[x, y]
. In BioAlignments.jl, SubstitutionMatrix
and DichotomousSubstitutionMatrix
are two distinct types representing substitution matrices.
SubstitutionMatrix
is a general substitution matrix type that is a thin wrapper of regular matrix.
Some common substitution matrices are provided. For DNA and RNA, EDNAFULL
is defined:
julia> EDNAFULL
SubstitutionMatrix{BioSymbols.DNA, Int64}:
A C M G R S V T W Y H K D B N
A 5 -4 1 -4 1 -4 -1 -4 1 -4 -1 -4 -1 -4 -2
C -4 5 1 -4 -4 1 -1 -4 -4 1 -1 -4 -4 -1 -2
M 1 1 -1 -4 -2 -2 -1 -4 -2 -2 -1 -4 -3 -3 -1
G -4 -4 -4 5 1 1 -1 -4 -4 -4 -4 1 -1 -1 -2
R 1 -4 -2 1 -1 -2 -1 -4 -2 -4 -3 -2 -1 -3 -1
S -4 1 -2 1 -2 -1 -1 -4 -4 -2 -3 -2 -3 -1 -1
V -1 -1 -1 -1 -1 -1 -1 -4 -3 -3 -2 -3 -2 -2 -1
T -4 -4 -4 -4 -4 -4 -4 5 1 1 -1 1 -1 -1 -2
W 1 -4 -2 -4 -2 -4 -3 1 -1 -2 -1 -2 -1 -3 -1
Y -4 1 -2 -4 -4 -2 -3 1 -2 -1 -1 -2 -3 -1 -1
H -1 -1 -1 -4 -3 -3 -2 -1 -1 -1 -1 -3 -2 -2 -1
K -4 -4 -4 1 -2 -2 -3 1 -2 -2 -3 -1 -1 -1 -1
D -1 -4 -3 -1 -1 -3 -2 -1 -1 -3 -2 -1 -1 -2 -1
B -4 -1 -3 -1 -3 -1 -2 -1 -3 -1 -2 -1 -2 -1 -1
N -2 -2 -1 -2 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1
(underlined values are default ones)
For amino acids, PAM (Point Accepted Mutation) and BLOSUM (BLOcks SUbstitution Matrix) matrices are defined:
julia> BLOSUM62
SubstitutionMatrix{BioSymbols.AminoAcid, Int64}:
A R N D C Q E G H I L K M F P S T W Y V O U B J Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 0̲ 0̲ -2 0̲ -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 0̲ 0̲ -1 0̲ 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 0̲ 0̲ 3 0̲ 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 0̲ 0̲ 4 0̲ 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 0̲ 0̲ -3 0̲ -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0̲ 0̲ 0 0̲ 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 0̲ 0̲ 1 0̲ 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 0̲ 0̲ -1 0̲ -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0̲ 0̲ 0 0̲ 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 0̲ 0̲ -3 0̲ -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 0̲ 0̲ -4 0̲ -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0̲ 0̲ 0 0̲ 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 0̲ 0̲ -3 0̲ -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 0̲ 0̲ -3 0̲ -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 0̲ 0̲ -2 0̲ -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0̲ 0̲ 0 0̲ 0 0 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 0̲ 0̲ -1 0̲ -1 0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 0̲ 0̲ -4 0̲ -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 0̲ 0̲ -3 0̲ -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 0̲ 0̲ -3 0̲ -2 -1 -4
O 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲
U 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 0̲ 0̲ 4 0̲ 1 -1 -4
J 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲ 0̲
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 0̲ 0̲ 1 0̲ 4 -1 -4
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 0̲ 0̲ -1 0̲ -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 0̲ 0̲ -4 0̲ -4 -4 1
(underlined values are default ones)
PAM and BLOSUM matrices are downloaded from: ftp://ftp.ncbi.nih.gov/blast/matrices/.
CHEMICAL is another kind of substitution matrix for amino acids. As of now, it consists of a single matrix, GRANTHAM1974
, which is taken from R. Grantham's 1974 paper in which the distance between any two amino acids (of the basic 20 amino acids) is based on three properties: composition, polarity, and molecular volume. Based on the original average score in the table, we have taken ourselves the liberty to define default_mismatch = 100
as a substitution score between two amino acids if any of them is not present in that table.
Matrix | Constants |
---|---|
PAM | PAM30 , PAM70 , PAM250 |
BLOSUM | BLOSUM45 , BLOSUM50 , BLOSUM62 , BLOSUM80 , BLOSUM90 |
CHEMICAL | GRANTHAM1974 |
SubstitutionMatrix
can be modified like a regular matrix:
julia> mysubmat = copy(BLOSUM62); # create a copy
julia> mysubmat[AA_A, AA_R] # score of AA_A => AA_R substitution is -1
-1
julia> mysubmat[AA_A, AA_R] = -3 # set the score to -3
-3
julia> mysubmat[AA_A, AA_R] # the score is modified
-3
Make sure to create a copy of the original matrix when you create a matrix from a predefined matrix. In the above case, BLOSUM62
is shared in the whole program and modification on it will affect any result that uses BLOSUM62
.
Other supported operations include similar
, isassigned
, and BioSymbols.alphabet
:
julia> reshape(alphabet(BLOSUM62), 1, :)
1×27 Matrix{AminoAcid}:
AA_A AA_R AA_N AA_D AA_C AA_Q AA_E … AA_B AA_J AA_Z AA_X AA_Term
As shown, there is no entry corresponding to AA_Gap
.
DichotomousSubstitutionMatrix
is a specialized matrix for matching or mismatching substitution. This is a preferable choice when performance is more important than flexibility because looking up score is faster than SubstitutionMatrix
.
julia> submat = DichotomousSubstitutionMatrix(1, -1)
DichotomousSubstitutionMatrix{Int64}:
match = 1
mismatch = -1
julia> submat['A', 'A'] # match
1
julia> submat['A', 'B'] # mismatch
-1