High-throughput Sequencing
Overview
High-throughput sequencing (HTS) technologies generate a large amount of data in the form of a large number of nucleotide sequencing reads. One of the most common tasks in bioinformatics is to align these reads against known reference genomes, chromosomes, or contigs. The BioAlignments
module provides several data formats commonly used for this kind of task.
SAM and BAM file formats
SAM and BAM are the most popular file formats and have the same reading and writing interface as all other formats in Bio.jl. A typical code iterating over all records in a file looks like below:
# import the SAM and BAM module using BioAlignments # open a BAM file reader = open(BAM.Reader, "data.bam") # iterate over BAM records for record in reader # `record` is a BAM.Record object if BAM.ismapped(record) # print mapped position println(BAM.refname(record), ':', BAM.position(record)) end end # close the BAM file close(reader)
Accessor functions are defined in SAM
and BAM
modules. Lists of these functions to SAM.Record
and BAM.Record
are described in SAM and BAM sections, respectively.
SAM.Reader
and BAM.Reader
implement the header
function, which returns a SAM.Header
object. This is conceptually a sequence of SAM.MetaInfo
objects corresponding to header lines that start with '@' markers. To select SAM.MetaInfo
records with a specific tag, you can use the find
function:
julia> reader = open(SAM.Reader, "data.sam"); julia> find(header(reader), "SQ") 7-element Array{Bio.Align.SAM.MetaInfo,1}: Bio.Align.SAM.MetaInfo: tag: SQ value: SN=Chr1 LN=30427671 Bio.Align.SAM.MetaInfo: tag: SQ value: SN=Chr2 LN=19698289 Bio.Align.SAM.MetaInfo: tag: SQ value: SN=Chr3 LN=23459830 Bio.Align.SAM.MetaInfo: tag: SQ value: SN=Chr4 LN=18585056 Bio.Align.SAM.MetaInfo: tag: SQ value: SN=Chr5 LN=26975502 Bio.Align.SAM.MetaInfo: tag: SQ value: SN=chloroplast LN=154478 Bio.Align.SAM.MetaInfo: tag: SQ value: SN=mitochondria LN=366924
A SAM.MetaInfo
object can be created as follows:
julia> SAM.MetaInfo("SQ", ["SN" => "chr1", "LN" => 1234]) Bio.Align.SAM.MetaInfo: tag: SQ value: SN=chr1 LN=1234 julia> SAM.MetaInfo("CO", "comment") Bio.Align.SAM.MetaInfo: tag: CO value: comment
Getting records in a range
BioAlignments.jl supports the BAI index to fetch records in a specific range from a BAM file. Samtools provides index
subcommand to create an index file (.bai) from a sorted BAM file.
$ samtools index -b SRR1238088.sort.bam $ ls SRR1238088.sort.bam* SRR1238088.sort.bam SRR1238088.sort.bam.bai
eachoverlap(reader, chrom, range)
returns an iterator of BAM records overlapping the query interval:
reader = open(BAM.Reader, "SRR1238088.sort.bam", index="SRR1238088.sort.bam.bai") for record in eachoverlap(reader, "Chr2", 10000:11000) # `record` is a BAM.Record object # ... end close(reader)
eachoverlap
also supports Interval
type definded in BioAlignments.jl.
# Load GFF3 module. using GenomicFeatures # Load genomic features from a GFF3 file. features = open(collect, GFF3.Reader, "TAIR10_GFF3_genes.gff") # Keep mRNA features. filter!(x -> GFF3.featuretype(x) == "mRNA", features) # Load BAM module. using BioAlignments # Open a BAM file and iterate over records overlapping mRNA transcripts. reader = open(BAM.Reader, "SRR1238088.sort.bam", index="SRR1238088.sort.bam.bai") for feature in features for record in eachoverlap(reader, feature) # `record` overlaps `feature`. # ... end end close(reader)
Performance tips
The size of a BAM file is often extremely large. The iterator interface mentioned above allocates an object for each record and that may be a bottleneck of reading data from a BAM file. In-place reading reuses a preallocated object for every record and less memory allocation happens in reading:
reader = open(BAM.Reader, "data.bam") record = BAM.Record() while !eof(reader) read!(reader, record) # do something end
Accessing optional fields will results in type instability in Julia, which has a significant negative impact on performance. If the user knows the type of a value in advance, specifying it as a type annotation will alleviate the problem:
for record in open(BAM.Reader, "data.bam") nm = record["NM"]::UInt8 # do something end