As mentioned in the previous tutorial, in this chapter, we will learn about alignments. We will explore pairwise alignment as a tool to compare two copies of the mecA gene found on NCBI.

BioAlignments Implements Only Pairwise Alignment

On the most basic level, aligners use algorithms to "line up" sequences and look for regions of similarity.

BioAlignments implements pairwise alignment. Pairwise alignment differs from multiple sequence alignment (MSA) because it only aligns two sequences, while MSAs align any number of sequences. There is not currently a MSA package in Julia.

Pairwise alignment also assumes that the two sequences are roughly homologous. For example, you may use it to align two versions of the same gene. It is not used to map reads to a genome -- mapping would be a better solution for that. If mapping is your goal, you can use a mapper like minimap2 and parse the result with PairwiseMappingFormat.jl.

Running the Alignment

There are two main parameters for determining how we want to perform our alignment: alignment type and score/cost model.

The alignment type specifies the alignment range (local vs global alignment) and the score/cost model explains how to score matches/mismatches in the sequences that are being compared.

Alignment Types

Currently, four types of alignments are supported:

The alignment type should be selected based on what is already known about the sequences the user is comparing:

Cost Model

The cost model provides a way to calculate penalties for differences between the two sequences, and then finds the alignment that minimizes the total penalty. AffineGapScoreModel is the scoring model currently supported by BioAlignments.jl. It imposes an affine gap penalty for insertions and deletions, which means that it penalizes the opening of a gap more than a gap extension. Deletions are rare mutations, but if there's a deletion, the length of the deletion is variable. Longer deletions are less likely than short ones only because they change the structure of the encoded protein more.

A user can also define their own CostModel instead of using AffineGapScoreModel. This will allow the user to define their own scoring scheme for penalizing insertions, deletions, and substitutions.

After the cost model is defined, a distance metric is used to quantify and minimize the "cost" (difference) between the two sequences.

These distance metrics are currently supported:

More information can be found in the BioAlignments documentation about the cost model here.

Just like alignment type, the cost model should be selected based on what the user is optimizing for and what is known about the two sequences.

Calling BioAlignments to Run the Alignment

Now that we have a good understanding of how pairalign works, let's run an example!

In this example, we'll compare two similar genes: mecA found in S. aureus (link to gene on NCBI here), and a homologue, mecA1, found in S. sciuri (link to gene on NCBI here). These genes are found in two Staph species that are closely related and in the same family (Staphylococcaceae).

Because we are comparing homologous genes from two closely related species, we wouldn't expect too many differences. Although mecA1 doesn't confer resistance to beta-lactams in S. sciuri like mecA does to S. aureus, the gene should be mostly conserved. In fact, mecA1 is considered a precursor to mecA 1. Research indicates that there is 80% nucleotide identity between the two genes 1. Due to the similarity in the genes we are comparing, it makes the most sense to run a global alignment.

In this first example, we'll align two strings that contain the genes.

Aligning BioSequences Object

using BioAlignments

mecA = 
"ATGAAAAAGATAAAAATTGTTCCACTTATTTTAATAGTTGTAGTTGTCGGGTTTGGTATATATTTTTATGCTTCAAAAGATAAAGAAATTAATAATACTATTGATGCAATTGAAGATAAAAATTTCAAACAAGTTTATAAAGATAGCAGTTATATTTCTAAAAGCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATATAATAGTTTAGGCGTTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAACGAGTAGATGCTCAATATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAATTTAATTTTGTTAAAGAAGATGGTATGTGGAAGTTAGATTGGGATCATAGCGTCATTATTCCAGGAATGCAGAAAGACCAAAGCATACATATTGAAAATTTAAAATCAGAACGTGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCAATACAGGAACAGCATATGAGATAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGCTAAAGAACTAAGTATTTCTGAAGACTATATCAAACAACAAATGGATCAAAATTGGGTACAAGATGATACCTTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGATTTCGCAAAAAAATTTCATCTTACAACTAATGAAACAAAAAGTCGTAACTATCCTCTAGAAAAAGCGACTTCACATCTATTAGGTTATGTTGGTCCCATTAACTCTGAAGAATTAAAACAAAAAGAATATAAAGGCTATAAAGATGATGCAGTTATTGGTAAAAAGGGACTCGAAAAACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACGATAATAGCAATACAATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATATTCAACTAACTATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATTATGGCTCAGGTACTGCTATCCACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACACCTTCATATGACGTCTATCCATTTATGTATGGCATGAGTAACGAAGAATATAATAAATTAACCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGATTACAACTTCACCAGGTTCAACTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGACGATAAAACAAGTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTACAAGATATGAAGTGGTAAATGGTAATATCGACTTAAAACAAGCAATAGAATCATCAGATAACATTTTCTTTGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCATGAAAAAACTAGGTGTTGGTGAAGATATACCAAGTGATTATCCATTTTATAATGCTCAAATTTCAAACAAAAATTTAGATAATGAAATATTATTAGCTGATTCAGGTTACGGACAAGGTGAAATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGCATTAGAAAATAATGGCAATATTAACGCACCTCACTTATTAAAAGACACGAAAAACAAAGTTTGGAAGAAAAATATTATTTCCAAAGAAAATATCAATCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACACATAAAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAACTCAAAATGAAACAAGGAGAAACTGGCAGACAAATTGGGTGGTTTATATCATATGATAAAGATAATCCAAACATGATGATGGCTATTAATGTTAAAGATGTACAAGATAAAGGAATGGCTAGCTACAATGCCAAAATCTCAGGTAAAGTGTATGATGAGCTATATGAGAACGGTAATAAAAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAACGATGGTTGCTTCACTGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTCTTCATTT"
mecA1 = "ATGAAAAAATTAATCATCGCCATCGTGATTGTAATCATCGCTGTTGGTTCAGGCGTATTCTTTTATGCATCTAAAGATAAGAAAATAAACGAAACAATTGATGCCATTGAAGATAAAAACGTTAAGCAAGTCTTTAAAAATAGTACTTACCAATCTAAAAACGATAATGGTGAAGTAGAAATGACAGACCGCCCTATTAAGATTTATGACAGTCTCGGCGTCAAAGATATCAACATTAAAGATCGTGATATCAAAAAGGTTTCGAAAAACAAAAAACAAGTCACAGCAAAGTATGAACTTCAAACGAATTACGGCAAAATTAATCGTGACGTTAAATTAAACTTTATTAAAGAAGATAAAGATTGGAAATTGGATTGGAATCAAAATGCCATTATTCCAGGCATGAAGAAAAATCAATCCATCAATATTGAACCATTGAAATCAGAACGAGGTAAGATTTTAGACAGGAACAATGTAGAGTTAGCCACTACAGGAACAACACATGAAGTTGGTATTGTTCCTAATAATGTTTCCACAAGTGATTACAAAGCAATCGCTGAAAAGTTAGACCTTTCAGAATCGTATATTAAACAGCAAACAGAACAGGATTGGGTTAAAGATGATACATTCGTCCCTCTCAAGACTGTTCAAGATATGAATCAAGATTTAAAGAATTTTGTTGAAAAGTATCATCTCACATCACAGGAAACAGAAAGTCGACAGTATCCGCTTGAAGAAGCAACAACGCACTTACTTGGATATGTTGGCCCTATTAATTCAGAAGAATTGAAGCAAAAAGCATTTAAAGGTTATAAAAAGGATGCCATCGTTGGTAAAAAAGGTATCGAAAAACTATACGATAAAGACCTTCAAAATAAAGACGGATACCGTGTCACAATAATTGATGATAATAATAAAGTTATTGATACATTAATAGAGAAAAAGAAAATAGACGGCAAAGATATTAAATTAACCATTGATGCTAGAGTCCAAAAAAGTATTTATAACAACATGAAAGATGACTACGGTTCGGGGACTGCTATTCATCCACAAACTGGTGAACTCTTAGCACTTGTCAGCACGCCATCTTATGATGTTTATCCATTTATGAATGGAATGAGCGATGAAGATTATAAGAAATTAACTGAAGATGATAAAGAGCCACTCCTTAATAAGTTCCAAATTACGACATCACCAGGTTCGACTCAAAAAATATTAACAGCCATGATTGGCTTAAACAATAAGACATTAGACGGCAAAACAAGTTATAAAATTAATGGAAAAGGTTGGCAAAAAGATAAATCTTGGGGTGACTACAACGTTACAAGATACGAAGTTGTGAATGCCGATATCGACTTAAAACAAGCTATTGAATCATCAGATAATATCTTCTTTGCGAGAGTTGCACTTGAATTAGGCAGCAAAAAATTCGAAGAAGGAATGAAACGCCTTGGTGTTGGTGAAGATATCCCGAGTGATTATCCATTCTACAATGCACAAATTTCAAATAAGAACTTAGATAATGAAATATTGTTAGCTGACTCAGGTTATGGCCAAGGTGAAATATTAATCAATCCTGTTCAAATTCTTTCAATATACAGCGCATTAGAGAACAAAGGTAATGTGAATGCACCACATGTACTCAAAGATACGAAAAATAAAGTCTGGAAGAAGAACATCATTTCCCAGGAAAATATTAAATTGTTAACAGACGGTATGCAACAAGTCGTGAACAAAACACATAGAGAAGATATTTATAGATCATATGCCAACTTAGTTGGTAAATCAGGTACAGCTGAACTCAAGATGAAACAAGGTGAGACAGGACAACAAATAGGTTGGTTCATTTCATATGATAAAGATAATCCAAATATAATGATGGCTATTAATGTGAAAGATGTACAAGATAAAGGCATGGCAAGTTACAATGCCAAAATATCTGGAAAAGTGTATGACGATTTATATGATAACGGTAAGAAAACGTATCGTATTGATAAATAA"
scoremodel = AffineGapScoreModel(EDNAFULL, gap_open=-5, gap_extend=-1);

# run pairwise alignment
res = pairalign(GlobalAlignment(), mecA, mecA1, scoremodel)  

Aligning FASTX files

In this next example, we'll repeat the same alignment, but read in the files directly from the FASTA files containing the gene. Running the alignment on strings is straightforward with short sequences, but when comparing entire genes, simply reading in the file is easier.

using BioSequences
using FASTX

# Function to get a sequence from a FASTA file with one record
function fasta_sequence(fasta_path)
    record = open(FASTA.Reader, fasta_path) do reader
        first(reader)
    end
    # extract sequence and convert to BioSequences DNA object
    seq = LongDNA{4}(String(FASTX.sequence(record)))
    return (seq)
end

mecA_fasta = fasta_sequence("assets/mecA.fasta")
mecA1_fasta = fasta_sequence("assets/mecA1.fasta")

 # run pairwise alignment
res_fasta = pairalign(GlobalAlignment(), mecA_fasta, mecA1_fasta, scoremodel) 

Understanding How Alignments Are Represented

The output of an alignment is a series of AlignmentAnchor objects. This data structure gives information on the position of the start of the alignment, sections where nucleotides match, as well as where there may be deletions or insertions.

Below is an example alignment:

julia> Alignment([
           AlignmentAnchor(0,  4, 0, OP_START),
           AlignmentAnchor(4,  8, 4, OP_MATCH),
           AlignmentAnchor(4, 12, 8, OP_DELETE)
       ])

In this example, the alignment starts at position 0 for the query sequence and position 4 for the reference sequence. Although the Julia programming language typically uses 1-based indexing, this package uses position 0 to refer to the first position. The next nucleotides match in the query and reference sequences. The last 8 nucleotides in the alignment are deleted in the query sequence.

To learn more about the output of the alignment created using BioAlignments.jl, see here.

Interpreting the Example Output

Here is the output from our example aligning the mecA and mecA1 genes:

res

PairwiseAlignmentResult{Int64, String, String}:
  score: 6375
  seq:    1 ATGAAAAAGATAAAA-ATTGTTCCA-CTT-ATTTTAAT-A-----GTTGTAGTTGTCGGG   51
            |||||||| || ||  || |  ||| | | ||| |||| |     ||||  |||  | ||
  ref:    1 ATGAAAAA-ATTAATCATCG--CCATCGTGATTGTAATCATCGCTGTTG--GTT--CAGG   53

  seq:   52 TTTGGTATATAT-TTTTATGCTTCAAAAGATAAAGAAATTAAT--AATACTATTGATGCA  108
                ||||   | |||||||| || |||||||| |||| |||   || || |||||||| 
  ref:   54 C---GTAT---TCTTTTATGCATCTAAAGATAA-GAAAATAAACGAA-ACAATTGATGCC  105

  seq:  109 ATTGAAGATAAAAA--TTTCAAACAAGT-TTATAAAGATAGCAGTTAT--ATTTCTAAAA  163
            ||||||||||||||  ||  || ||||| || |||| |||| | |||   ||  ||||||
  ref:  106 ATTGAAGATAAAAACGTT--AAGCAAGTCTT-TAAAAATAGTACTTACCAAT--CTAAAA  160

  seq:  164 GCGATAATGGTGAAGTAGAAATGACTGAACGTCCGATAAAAATATATAATAGTTTAGGCG  223
             |||||||||||||||||||||||| || || || || || || ||| | ||| | ||||
  ref:  161 ACGATAATGGTGAAGTAGAAATGACAGACCGCCCTATTAAGATTTATGACAGTCTCGGCG  220

  seq:  224 TTAAAGATATAAACATTCAGGATCGTAAAATAAAAAAAGTATCTAAAAATAAAAAACGAG  283
            | |||||||| |||||| | |||||| | || ||||| || || ||||| ||||||| ||
  ref:  221 TCAAAGATATCAACATTAAAGATCGTGATATCAAAAAGGTTTCGAAAAACAAAAAACAAG  280

  seq:  284 T-AGATGCTCAA--TATAAAATTAAAACAAACTACGGTAACATTGATCGCAACGTTCAAT  340
            | | | ||  ||  ||| || || |||| || ||||| || ||| ||||  ||||| |||
  ref:  281 TCACA-GC--AAAGTATGAACTTCAAACGAATTACGGCAAAATTAATCGTGACGTTAAAT  337

  seq:  341 TTAATTTTGTTAAAGAAGAT---GGTATGTGGAAGTTAGATTGGGATCATA--GCGTCAT  395
            | || ||| |||||||||||   |  || ||||| || |||||| |||| |  ||  |||
  ref:  338 TAAACTTTATTAAAGAAGATAAAG--AT-TGGAAATTGGATTGGAATCAAAATGC--CAT  392

  seq:  396 TATTCCAGGAATGCAGAAAGACCAAAGCATACA-TATTGAA--AATTTAAAATCAGAACG  452
            ||||||||| ||| ||||| | |||  ||| || |||||||  |  || |||||||||||
  ref:  393 TATTCCAGGCATGAAGAAAAATCAATCCAT-CAATATTGAACCA--TTGAAATCAGAACG  449

  seq:  453 TGGTAAAATTTTAGACCGAAACAATGTGGAATTGGCCAATACAGGAACAGCATATGA-GA  511
             ||||| ||||||||| | |||||||| || || |||| |||||||||| || |||| | 
  ref:  450 AGGTAAGATTTTAGACAGGAACAATGTAGAGTTAGCCACTACAGGAACAACACATGAAGT  509

  seq:  512 TAGGCATCGTTCCAAAGAATGTATCTAAAAAAGATTATAAAGCAATCGCTAAAGAACTAA  571
            | || || ||||| || ||||| || | ||  ||||| ||||||||||||   |||  ||
  ref:  510 T-GGTATTGTTCCTAATAATGTTTCCACAAGTGATTACAAAGCAATCGCT---GAA--AA  563

  seq:  572 GT-A----TTTCTGAA--GACTATATCAAACAACAAATGGATCAAAATTGGGT-ACAAGA  623
            || |    |||| |||  |  ||||| ||||| ||||  || ||  ||||||| | ||||
  ref:  564 GTTAGACCTTTCAGAATCG--TATATTAAACAGCAAACAGAACAGGATTGGGTTA-AAGA  620

  seq:  624 TGATACCTTCGTTCCACTTAAAACCGTTAAAAAAATGGATGAATATTTAAGTGA-TTTCG  682
            |||||| ||||| || || || || ||| || | ||| || || ||||||  || ||| |
  ref:  621 TGATACATTCGTCCCTCTCAAGACTGTTCAAGATATGAATCAAGATTTAAA-GAATTTTG  679

  seq:  683 CAAAAAAATTTCATCTTACA--ACTAATGAAACAAAAAGTCGTAAC--TATCCTCTAGAA  738
               |||| | |||||| |||  ||  | |||||| |||||||  ||  ||||| || |||
  ref:  680 TTGAAAAGTATCATCTCACATCAC--AGGAAACAGAAAGTCG--ACAGTATCCGCTTGAA  735

  seq:  739 AAAGCGACTTCACATCT-A-TTAGGTTATGTTGGTCCC-ATTAACTCTGAAGAATTAAAA  795
             |||| ||  | || || | || || |||||||| ||| ||||| || |||||||| || 
  ref:  736 GAAGCAACAACGCA-CTTACTT-GGATATGTTGG-CCCTATTAATTCAGAAGAATTGAAG  792

  seq:  796 CAAAAAGAATATAAAGGCTATAAAGATGATGC-AGTTATTGGTAAAAAGGG-ACTCGAAA  853
            ||||||| || |||||| |||||| | ||||| | |  |||||||||| || | ||||||
  ref:  793 CAAAAAGCATTTAAAGGTTATAAAAAGGATGCCA-TCGTTGGTAAAAAAGGTA-TCGAAA  850

  seq:  854 AACTTTACGATAAAAAGCTCCAACATGAAGATGGCTATCGTGTCACAATCGTTGACGATA  913
            |||| ||||||||| | || ||| || |||| || || |||||||||||  |||| ||||
  ref:  851 AACTATACGATAAAGACCTTCAAAATAAAGACGGATACCGTGTCACAATAATTGATGATA  910

  seq:  914 ATAGCAATACA---ATCGCACATACATTAATAGAGAAAAAGAAAAAAGATGGCAAAGATA  970
            |||   ||| |   || |   |||||||||||||||||||||||| ||| ||||||||||
  ref:  911 ATA---ATAAAGTTATTG---ATACATTAATAGAGAAAAAGAAAATAGACGGCAAAGATA  964

  seq:  971 TTCAACTAACTATTGATGCTAAAGTTCAAAAGAGTATTTATAACAACATGAAAAATGATT 1030
            || || |||| |||||||||| ||| ||||| ||||||||||||||||||||| |||| |
  ref:  965 TTAAATTAACCATTGATGCTAGAGTCCAAAAAAGTATTTATAACAACATGAAAGATGACT 1024

  seq: 1031 ATGGCTCAGGTACTGCTATCCACCCTCAAACAGGTGAATTATTAGCACTTGTAAGCACAC 1090
            | || || || |||||||| || || ||||| |||||| | ||||||||||| ||||| |
  ref: 1025 ACGGTTCGGGGACTGCTATTCATCCACAAACTGGTGAACTCTTAGCACTTGTCAGCACGC 1084

  seq: 1091 CTTCATATGACGTCTATCCATTTATGTATGGCATGAGTAACGAAGAATATAATAAATTAA 1150
            | || ||||| || |||||||||||| |||| |||||  | ||||| ||||| |||||||
  ref: 1085 CATCTTATGATGTTTATCCATTTATGAATGGAATGAGCGATGAAGATTATAAGAAATTAA 1144

  seq: 1151 CCGAAGATAAAAAAGAACCTCTGCTCAACAAGTTCCAGATTACAACTTCACCAGGTTCAA 1210
            | |||||| | ||||| || || || || |||||||| ||||| || ||||||||||| |
  ref: 1145 CTGAAGATGATAAAGAGCCACTCCTTAATAAGTTCCAAATTACGACATCACCAGGTTCGA 1204

  seq: 1211 CTCAAAAAATATTAACAGCAATGATTGGGTTAAATAACAAAACATTAGACGATAAAACAA 1270
            ||||||||||||||||||| |||||||| ||||| || || ||||||||||  |||||||
  ref: 1205 CTCAAAAAATATTAACAGCCATGATTGGCTTAAACAATAAGACATTAGACGGCAAAACAA 1264

  seq: 1271 GTTATAAAATCGATGGTAAAGGTTGGCAAAAAGATAAATCTTGGGGTGGTTACAACGTTA 1330
            ||||||||||  |||| |||||||||||||||||||||||||||||||  ||||||||||
  ref: 1265 GTTATAAAATTAATGGAAAAGGTTGGCAAAAAGATAAATCTTGGGGTGACTACAACGTTA 1324

  seq: 1331 CAAGATATGAAGTGGTAAATG--GTAATATCGACTTAAAACAAGCAATAGAATCATCAGA 1388
            ||||||| ||||| || ||||  |  ||||||||||||||||||| || |||||||||||
  ref: 1325 CAAGATACGAAGTTGTGAATGCCG--ATATCGACTTAAAACAAGCTATTGAATCATCAGA 1382

  seq: 1389 TAACATTTTCTTTGCTAGAGTAGCACTCGAATTAGGCAGTAAGAAATTTGAAAAAGGCAT 1448
            ||| || |||||||| ||||| ||||| ||||||||||| || ||||| ||| |||| ||
  ref: 1383 TAATATCTTCTTTGCGAGAGTTGCACTTGAATTAGGCAGCAAAAAATTCGAAGAAGGAAT 1442

  seq: 1449 GAAAAAACTAGGTGTTGGTGAAGATATACCAAGTGATTATCCATTTTATAATGCTCAAAT 1508
            ||||   || ||||||||||||||||| || |||||||||||||| || ||||| |||||
  ref: 1443 GAAACGCCTTGGTGTTGGTGAAGATATCCCGAGTGATTATCCATTCTACAATGCACAAAT 1502

  seq: 1509 TTCAAACAAAAATTTAGATAATGAAATATTATTAGCTGATTCAGGTTACGGACAAGGTGA 1568
            |||||| || || ||||||||||||||||| |||||||| |||||||| || ||||||||
  ref: 1503 TTCAAATAAGAACTTAGATAATGAAATATTGTTAGCTGACTCAGGTTATGGCCAAGGTGA 1562

  seq: 1569 AATACTGATTAACCCAGTACAGATCCTTTCAATCTATAGCGCATTAGAAAATAATGGCAA 1628
            |||| | || || || || || || |||||||| || ||||||||||| || || || ||
  ref: 1563 AATATTAATCAATCCTGTTCAAATTCTTTCAATATACAGCGCATTAGAGAACAAAGGTAA 1622

  seq: 1629 TATTAACGCACCTCACT-TATTAAAAGACACGAAAAACAAAGTTTGGAAGAAAAATATTA 1687
            | | || ||||| || | || | ||||| |||||||| ||||| |||||||| || || |
  ref: 1623 TGTGAATGCACCACA-TGTACTCAAAGATACGAAAAATAAAGTCTGGAAGAAGAACATCA 1681

  seq: 1688 TTTCCAAAGAAAATATCAA-TCTATTAACTGATGGTATGCAACAAGTCGTAAATAAAACA 1746
            ||||| | |||||||| || | | ||||| || ||||||||||||||||| || ||||||
  ref: 1682 TTTCCCAGGAAAATATTAAAT-TGTTAACAGACGGTATGCAACAAGTCGTGAACAAAACA 1740

  seq: 1747 CATAAAGAAGATATTTATAGATCTTATGCAAACTTAATTGGCAAATCCGGTACTGCAGAA 1806
            |||| |||||||||||||||||| ||||| |||||| |||| ||||| ||||| || |||
  ref: 1741 CATAGAGAAGATATTTATAGATCATATGCCAACTTAGTTGGTAAATCAGGTACAGCTGAA 1800

  seq: 1807 CTCAAAATGAAACAAGGAGAAACTGG-CAGACAAATTGGGTGGTTTATATCATATGATAA 1865
            ||||| ||||||||||| || || || || |||||| || ||||| || |||||||||||
  ref: 1801 CTCAAGATGAAACAAGGTGAGACAGGACA-ACAAATAGGTTGGTTCATTTCATATGATAA 1859

  seq: 1866 AGATAATCCAAACATGATGATGGCTATTAATGTTAAAGATGTACAAGATAAAGGAATGGC 1925
            |||||||||||| || ||||||||||||||||| |||||||||||||||||||| |||||
  ref: 1860 AGATAATCCAAATATAATGATGGCTATTAATGTGAAAGATGTACAAGATAAAGGCATGGC 1919

  seq: 1926 TAGCTACAATGCCAAAATCTCAGGTAAAGTGTATGATGAGCTATATGAGAACGGTAATAA 1985
             || |||||||||||||| || || ||||||||||| ||  ||||||| |||||||| ||
  ref: 1920 AAGTTACAATGCCAAAATATCTGGAAAAGTGTATGACGATTTATATGATAACGGTAAGAA 1979

  seq: 1986 AAAATACGATATAGATGAATAACAAAACAGTGAAGCAATCCGTAACGATGGTTGCTTCAC 2045
            ||                                      |||| ||             
  ref: 1980 AA--------------------------------------CGTATCG------------- 1988

  seq: 2046 TGTTTTATTATGAATTATTAATAAGTGCTGTTACTTCTCCCTTAAATACAATTTCTTCAT 2105
                 ||||  ||  ||  |||||                                    
  ref: 1988 -----TATT--GA--TA--AATAA------------------------------------ 2001

  seq: 2106 TT 2107
              
  ref: 2001 -- 2001

The alignment score is entirely dependent on the scoring scheme (how we penalized gaps, gap extensions and rewarded matches). It is not an absolute number that we can compare from alignment to alignment. In our example, our score was influenced by a -5 penalty for the start of a gap, and a -1 penalty for a gap extension. If these values were to change in our cost model, this would affect the final score, even if the sequences were the same. However, longer alignments generally produce larger scores (as there are more bases being compared).

Overall, the two sequences are homologous over most of their length. There are many matches, but also frequent small indels and substitutions. The biggest mismatch is in a section toward the end (from base 1980 in the reference onwards), where there are large stretches that are missing in the reference sequence (mecA1).

Citations