edit

High-throughput Sequencing

Overview

High-throughput sequencing (HTS) technologies generate a large amount of data in the form of a large number of nucleotide sequencing reads. One of the most common tasks in bioinformatics is to align these reads against known reference genomes, chromosomes, or contigs. The BioAlignments module provides several data formats commonly used for this kind of task.

SAM and BAM file formats

SAM and BAM are the most popular file formats and have the same reading and writing interface as all other formats in Bio.jl. A typical code iterating over all records in a file looks like below:

# import the SAM and BAM module
using BioAlignments

# open a BAM file
reader = open(BAM.Reader, "data.bam")

# iterate over BAM records
for record in reader
    # `record` is a BAM.Record object
    if BAM.ismapped(record)
        # print mapped position
        println(BAM.refname(record), ':', BAM.position(record))
    end
end

# close the BAM file
close(reader)

Accessor functions are defined in SAM and BAM modules. Lists of these functions to SAM.Record and BAM.Record are described in SAM and BAM sections, respectively.

SAM.Reader and BAM.Reader implement the header function, which returns a SAM.Header object. This is conceptually a sequence of SAM.MetaInfo objects corresponding to header lines that start with '@' markers. To select SAM.MetaInfo records with a specific tag, you can use the find function:

julia> reader = open(SAM.Reader, "data.sam");

julia> find(header(reader), "SQ")
7-element Array{Bio.Align.SAM.MetaInfo,1}:
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr1 LN=30427671
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr2 LN=19698289
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr3 LN=23459830
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr4 LN=18585056
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr5 LN=26975502
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=chloroplast LN=154478
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=mitochondria LN=366924

A SAM.MetaInfo object can be created as follows:

julia> SAM.MetaInfo("SQ", ["SN" => "chr1", "LN" => 1234])
Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=chr1 LN=1234

julia> SAM.MetaInfo("CO", "comment")
Bio.Align.SAM.MetaInfo:
    tag: CO
  value: comment

Getting records in a range

BioAlignments.jl supports the BAI index to fetch records in a specific range from a BAM file. Samtools provides index subcommand to create an index file (.bai) from a sorted BAM file.

$ samtools index -b SRR1238088.sort.bam
$ ls SRR1238088.sort.bam*
SRR1238088.sort.bam     SRR1238088.sort.bam.bai

eachoverlap(reader, chrom, range) returns an iterator of BAM records overlapping the query interval:

reader = open(BAM.Reader, "SRR1238088.sort.bam", index="SRR1238088.sort.bam.bai")
for record in eachoverlap(reader, "Chr2", 10000:11000)
    # `record` is a BAM.Record object
    # ...
end
close(reader)

eachoverlap also supports Interval type definded in BioAlignments.jl.

# Load GFF3 module.
using GenomicFeatures

# Load genomic features from a GFF3 file.
features = open(collect, GFF3.Reader, "TAIR10_GFF3_genes.gff")

# Keep mRNA features.
filter!(x -> GFF3.featuretype(x) == "mRNA", features)

# Load BAM module.
using BioAlignments

# Open a BAM file and iterate over records overlapping mRNA transcripts.
reader = open(BAM.Reader, "SRR1238088.sort.bam", index="SRR1238088.sort.bam.bai")
for feature in features
    for record in eachoverlap(reader, feature)
        # `record` overlaps `feature`.
        # ...
    end
end
close(reader)

Performance tips

The size of a BAM file is often extremely large. The iterator interface mentioned above allocates an object for each record and that may be a bottleneck of reading data from a BAM file. In-place reading reuses a preallocated object for every record and less memory allocation happens in reading:

reader = open(BAM.Reader, "data.bam")
record = BAM.Record()
while !eof(reader)
    read!(reader, record)
    # do something
end

Accessing optional fields will results in type instability in Julia, which has a significant negative impact on performance. If the user knows the type of a value in advance, specifying it as a type annotation will alleviate the problem:

for record in open(BAM.Reader, "data.bam")
    nm = record["NM"]::UInt8
    # do something
end