Reading and writing data

Bio.jl has a unified interface for reading and writing files in a variety of formats. Reader and writer type names have a prefix of the file format. For example, files of a format X can be read using XReader and can be written using XWriter. To initialize a reader/writer of X, you can use one of the following syntaxes:

# reader
open(::Type{XReader}, filepath::AbstractString, args...)
XReader(stream::IO, args...)

# writer
open(::Type{XWriter}, filepath::AbstractString, args...)
XWriter(stream::IO, args...)

For example, when reading a FASTA file, a reader for the FASTA file format can be initialized as:

using Bio.Seq  # import FASTAReader
reader = open(FASTAReader, "hg38.fa")
# do something
close(reader)

Reading by iteration

Readers in Bio.jl all read and return entries one at a time. The most convenient way to do this by iteration:

reader = open(BEDReader, "input.bed")
for record in reader
    # perform some operation on entry
end
close(reader)

In-place reading

Iterating through entries in a file is convenient, but for each entry in the file, the reader must allocate, and ultimately the garbage collector must spend time to deallocate it. For performance critical applications, a separate lower level parsing interface can be used that avoid unnecessary allocation by overwriting one entry. For files with a large number of small entries, this can greatly speed up reading.

Instead of looping over a reader stream read! is called with a preallocated entry.

reader = open(BEDReader, "input.bed")
record = BEDInterval()
while !eof(reader)
    read!(reader, record)
    # perform some operation on `entry`
end
close(reader)

Some care is necessary when using this interface. Because entry is completely overwritten on each iteration, one must manually copy any field from entry that should be preserved. For example, if we wish to save the seqname field from entry when parsing BED, we must call copy(entry.seqname).

Empty entry types that correspond to the file format be found using eltype, making it easy to allocate an empty entry for any reader stream.

entry = eltype(stream)()

Writing data

A FASTA file will be created as follows:

writer = open(FASTAWriter, "out.fa")
write(writer, FASTASeqRecord("seq1", dna"ACGTN"))
write(writer, FASTASeqRecord("seq2", dna"TTATA", "AT rich"))
close(writer)

Another way is using Julia's do-block syntax, which closes the data file after finished writing:

open(FASTAWriter, "out.fa") do writer
    write(writer, FASTASeqRecord("seq1", dna"ACGTN"))
    write(writer, FASTASeqRecord("seq2", dna"TTATA", "AT rich"))
end

Supported file formats

The following table summarizes supported file formats.

File format	Prefix	Module	Specification
FASTA	`FASTA`	`Bio.Seq`	https://en.wikipedia.org/wiki/FASTA_format
FASTQ	`FASTQ`	`Bio.Seq`	https://en.wikipedia.org/wiki/FASTQ_format
.2bit	`TwoBit`	`Bio.Seq`	http://genome.ucsc.edu/FAQ/FAQformat.html#format7
BED	`BED`	`Bio.Intervals`	https://genome.ucsc.edu/FAQ/FAQformat.html#format1
GFF3	`GFF3`	`Bio.Intervals`	https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
bigBed	`BigBed`	`Bio.Intervals`	https://doi.org/10.1093/bioinformatics/btq351
PDB	`PDB`	`Bio.Structure`	http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
SAM	`SAM`	`Bio.Align`	https://samtools.github.io/hts-specs/SAMv1.pdf
BAM	`BAM`	`Bio.Align`	https://samtools.github.io/hts-specs/SAMv1.pdf

FASTA

Reader type: FASTAReader{S<:Sequence}
Writer type: FASTAWriter{T<:IO}
Element type: SeqRecord{S,FASTAMetadata} (alias: FASTASeqRecord{S})

FASTA is a text-based file format for representing biological sequences. A FASTA file stores a list of sequence records with name, description, and sequence. The template of a sequence record is:

>{name} {description}?
{sequence}

Here is an example of a chromosomal sequence:

>chrI chromosome 1
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG

Usually sequence records will be read sequentially from a file by iteration. But if the FASTA file has an auxiliary index file formatted in fai, the reader supports random access to FASTA records, which would be useful when accessing specific parts of a huge genome sequence:

reader = open(FASTAReader, "sacCer.fa", index="sacCer.fa.fai")
chrIV = reader["chrIV"]  # directly read chromosome 4

# Bio.Seq.FASTAReader — Type.

FASTAReader(input::IO; index=nothing)
FASTAReader{S}(input::IO; index=nothing)

Create a data reader of the FASTA file format.

When type parameter S is specified, the reader reads sequences in that type; otherwise the reader tries to infer the sequence type based on the frequencies of characters from the input.

Arguments

input: data source
index=nothing: filepath to a random access index (currently fai is supported)

Quality encoding	Symbol	ASCII offset	Quality range
Sanger	`:sanger`	+33	0-93
Solexa	`:solexa`	+64	-5-62
Illumina 1.3+	`:illumina13`	+64	0-62
Illumina 1.5+	`:illumina15`	+64	2-62
Illumina 1.8+	`:illumina18`	+33	0-93

Reading and writing data

Reading by iteration

In-place reading

Writing data

Supported file formats

FASTA

FASTQ

.2bit

BED

GFF3

bigBed

PDB

SAM

BAM