Introduction to BLAST

A BLAST search allows you to query a sequence (either nucleotide or protein) against an entire database of sequences. It can quickly compare unknown sequences to databases of established reference sequences for purposes such as species identity or assignment gene function.

More information about how to use BLAST can be found in its manual.

BLAST searches can be run from the command line interface (CLI) or through the BLAST web page here. A user can simply copy in a protein or nucleotide sequence and search against NCBI to find the best match! While searching from the website is fast and straightforward, running searches from the website only performs searches against the NCBI databases. The CLI allows users to query both NCBI databases and custom databases.

NCBIBlast.jl is a thin wrapper around the BLAST command line tool, allowing users to run the tool within Julia. The following BLAST tools are supported by NCBIBlast:

A benefit of using the package is that it uses Julia's powerful BinaryBuilder.jl architechture to bundle the BLAST+ executables so that you don't have to deal with that. In other words, if you do add NCBIBLAST.jl, you don't need to have the command line tools available.

Note: BioTools BLAST is a deprecated Julia package for running BLAST searches and is different from NCBIBLAST.

How NCBIBlast.jl works

The keywords used in the tool are sent to the shell for running BLAST.

As stated on the GitHub docs, the Julia call

blastn(; query = "a_file.txt", db="mydb", out="results.txt")

is sent to the shell as

$ blastn -query a_file.txt -db mydb -out results.txt

BLAST databases

Just like running a BLAST search from the CLI, NCBIBlast.jl requires a BLAST database to search against. The user can build a local, custom database, or search against a specific NCBI database. A custom BLAST database is constructed from FASTA files that serve as reference sequences. A database can be built using the following command in NCBIBlast.jl:

makeblastdb(; in="test/example_files/dna2.fasta", dbtype="nucl")

More directions on building a BLAST database locally can be found here.

For our first example, we will replicate the example on the NCBIBlast.jl GitHub repository.

First, we will build a local database using a FASTA file found in the NCBIBlast github repository (link here). This file has been downloaded into assets as dna2.fasta.

makeblastdb(; in="assets/dna2.fasta", dbtype="nucl")

Building a new DB, current time: 03/16/2026 21:04:36
New DB name:   /LOCAL/PATH/BioTutorials/cookbook/assets/dna2.fasta
New DB title:  assets/dna2.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 2 sequences in 0.0012269 seconds.


Process(`/Users/USER/.julia/artifacts/0406b91031ce302fa9117606d007d04635279fef/ncbi-blast-2.16.0+/bin/makeblastdb -in assets/dna2.fasta -dbtype nucl`, ProcessExited(0))

A new database was built in assets.

Now, we can define our query sequence. We can save the query string in memory (using IOBuffer) rather than reading in a FASTA file.

buf = IOBuffer("TTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAG")

Now, we can run the BLAST search. The BLAST output format "6" means that the output will be tab-delimited text with no column names.

The BLAST output will be written into I/O.

io = IOBuffer();
blastn(buf; stdout=io, db="assets/dna2.fasta", outfmt="6");
seek(io, 0);

The command seek(io,0) moves the cursor to the start of the captured object (index 0) so it can be read into a dataframe via the DataFrames.jl package.

using CSV, DataFrames
CSV.read(io, DataFrame; header=false)

1×12 DataFrame
 Row │ Column1  Column2  Column3  Column4  Column5  Column6  Column7  Column8  Column9  Column10  Column11  Column12 
     │ String7  String7  Float64  Int64    Int64    Int64    Int64    Int64    Int64    Int64     Float64   Float64  
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Query_1  Test1      100.0       38        0        0        1       38       82       119  5.64e-18      71.3

Interpreting BLAST Output

This output tells us that the query sequence (Query_1 is the default name for the sequence because we didn't specify a name) matches Test1 in the reference database. There is 100% identity between the query and a region on Test1 that is 38 nucleotides long. There are 0 mismatches or gap openings. The match starts at index 1 on the query sequence, and ends at index 82. This region matches a region in the Test1 sequence spanning from index 82 to 119. The E-value is 5.64e-18, meaning that it is extremely unlikely that this match occurred simply due to chance.

Here is a description of the E-value from the NCBI website:

The Expect value (E) is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. The lower the E-value the more “significant” the match is. However keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance.

The bitscore is 71.3.

Here is a definition of bitscore from the NCBI BLAST glossary:

The bit score, S', is derived from the raw alignment score, S, taking the statistical properties of the scoring system into account. Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.

Example: BLASTing the mecA1 gene against all of NCBI

Now that we've tried BLAST'ing against a local, custom database, let's try BLAST'ing a piece of the mecA gene against NCBI. To create the query file mecA_BLAST.fasta, I randomly selected 140 nucleotides from mecA.fasta.

We should see that the query FASTA is a direct hit to the mecA gene (one of the NCBI hits should definitely be the NCBI sample NG_047945.1, which is the sample the gene fragment was extracted from).

For this BLAST search, I will search against the core_nt database, which is a faster, smaller, and more focused subset of the traditional nt (nucleotide) database. This newer database is the default as of August 2024. It seeks to reduce redundancy and storage requirements when downloading. More information about it can be found here.

General information about the different kinds of BLAST databases is also available here.

io = IOBuffer();
blastn("assets/mecA_BLAST.fasta"; db="core_nt", stdout=io, remote=true, outfmt="6 query subject expect")
seek(io, 0);
CSV.read(io, DataFrame; header=false)

Here is the output of the BLAST.

julia> CSV.read(io, DataFrame; header=false)
500×12 DataFrame
 Row │ Column1     Column2     Column3  Column4  Column5  Column6  Column7  Column8  Column9  Column10  Column11  Column12 
     │ String15    String15    Float64  Int64    Int64    Int64    Int64    Int64    Int64    Int64     Float64   Int64    
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ mecA_BLAST  CP026646.1    100.0      140        0        0        1      140    46719     46580  7.12e-65       259
   2 │ mecA_BLAST  CP154497.1    100.0      140        0        0        1      140    45702     45563  7.12e-65       259
   3 │ mecA_BLAST  CP049499.1    100.0      140        0        0        1      140    61014     60875  7.12e-65       259
   4 │ mecA_BLAST  CP030403.1    100.0      140        0        0        1      140    48633     48494  7.12e-65       259
   5 │ mecA_BLAST  CP132494.1    100.0      140        0        0        1      140  1867742   1867603  7.12e-65       259
   6 │ mecA_BLAST  MH798864.1    100.0      140        0        0        1      140      281       420  7.12e-65       259
   7 │ mecA_BLAST  CP162442.1    100.0      140        0        0        1      140   503005    503144  7.12e-65       259
   8 │ mecA_BLAST  OR082611.1    100.0      140        0        0        1      140     6415      6276  7.12e-65       259
   9 │ mecA_BLAST  CP053183.1    100.0      140        0        0        1      140    41607     41468  7.12e-65       259
  10 │ mecA_BLAST  CP085310.1    100.0      140        0        0        1      140  1610215   1610076  7.12e-65       259
  11 │ mecA_BLAST  CP162465.1    100.0      140        0        0        1      140  1140196   1140057  7.12e-65       259
  12 │ mecA_BLAST  CP133660.1    100.0      140        0        0        1      140  2821019   2821158  7.12e-65       259
  13 │ mecA_BLAST  CP049476.1    100.0      140        0        0        1      140    40314     40175  7.12e-65       259
  ⋮  │     ⋮           ⋮          ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮         ⋮         ⋮
 489 │ mecA_BLAST  CP093527.1    100.0      140        0        0        1      140    42814     42675  7.12e-65       259
 490 │ mecA_BLAST  CP035541.1    100.0      140        0        0        1      140    72431     72292  7.12e-65       259
 491 │ mecA_BLAST  CP145216.2    100.0      140        0        0        1      140    45830     45691  7.12e-65       259
 492 │ mecA_BLAST  CP193734.1    100.0      140        0        0        1      140    64595     64456  7.12e-65       259
 493 │ mecA_BLAST  CP083210.2    100.0      140        0        0        1      140    43331     43192  7.12e-65       259
 494 │ mecA_BLAST  MW052033.1    100.0      140        0        0        1      140      245       384  7.12e-65       259
 495 │ mecA_BLAST  CP168087.1    100.0      140        0        0        1      140    40806     40667  7.12e-65       259
 496 │ mecA_BLAST  CP150769.1    100.0      140        0        0        1      140  2583517   2583378  7.12e-65       259
 497 │ mecA_BLAST  CP030596.1    100.0      140        0        0        1      140    40848     40709  7.12e-65       259
 498 │ mecA_BLAST  MZ398128.1    100.0      140        0        0        1      140    16977     17116  7.12e-65       259
 499 │ mecA_BLAST  CP083199.2    100.0      140        0        0        1      140    40078     39939  7.12e-65       259
 500 │ mecA_BLAST  CP162663.1    100.0      140        0        0        1      140   938360    938499  7.12e-65       259
                                                                                                           475 rows omitted

There are 500 hits with the same E-value and Bitscore. This likely means that this sequence is an exact match to these 500 sequences in NCBI. Because of this, the first row in the results is not necessarily a better match than the 500th, even though it appears first.

To verify the first hit, we can look up the GenBankID of the first row: CP026646.1. The NCBI page for this sample confirms that this sample was phenotyped as S. aureus. Our query matches indices 46719 to 46580 on this reference genome. When we use the Graphics feature to visualize gene annotations on the reference genome, we see that there is a clear match to mecA in the region that matches the query.

BLAST Graphics

Overall, this confirms that our BLAST worked as corrected!