Reading and writing data
Bio.jl has a unified interface for reading and writing files in a variety of formats. Reader and writer type names have a prefix of the file format. For example, files of a format X
can be read using XReader
and can be written using XWriter
. To initialize a reader/writer of X
, you can use one of the following syntaxes:
# reader open(::Type{XReader}, filepath::AbstractString, args...) XReader(stream::IO, args...) # writer open(::Type{XWriter}, filepath::AbstractString, args...) XWriter(stream::IO, args...)
For example, when reading a FASTA file, a reader for the FASTA file format can be initialized as:
using Bio.Seq # import FASTAReader reader = open(FASTAReader, "hg38.fa") # do something close(reader)
Reading by iteration
Readers in Bio.jl all read and return entries one at a time. The most convenient way to do this by iteration:
reader = open(BEDReader, "input.bed") for record in reader # perform some operation on entry end close(reader)
In-place reading
Iterating through entries in a file is convenient, but for each entry in the file, the reader must allocate, and ultimately the garbage collector must spend time to deallocate it. For performance critical applications, a separate lower level parsing interface can be used that avoid unnecessary allocation by overwriting one entry. For files with a large number of small entries, this can greatly speed up reading.
Instead of looping over a reader stream read!
is called with a preallocated entry.
reader = open(BEDReader, "input.bed") record = BEDInterval() while !eof(reader) read!(reader, record) # perform some operation on `entry` end close(reader)
Some care is necessary when using this interface. Because entry
is completely overwritten on each iteration, one must manually copy any field from entry
that should be preserved. For example, if we wish to save the seqname
field from entry
when parsing BED, we must call copy(entry.seqname)
.
Empty entry types that correspond to the file format be found using eltype
, making it easy to allocate an empty entry for any reader stream.
entry = eltype(stream)()
Writing data
A FASTA file will be created as follows:
writer = open(FASTAWriter, "out.fa") write(writer, FASTASeqRecord("seq1", dna"ACGTN")) write(writer, FASTASeqRecord("seq2", dna"TTATA", "AT rich")) close(writer)
Another way is using Julia's do-block syntax, which closes the data file after finished writing:
open(FASTAWriter, "out.fa") do writer write(writer, FASTASeqRecord("seq1", dna"ACGTN")) write(writer, FASTASeqRecord("seq2", dna"TTATA", "AT rich")) end
Supported file formats
The following table summarizes supported file formats.
File format | Prefix | Module | Specification |
---|---|---|---|
FASTA | FASTA |
Bio.Seq |
https://en.wikipedia.org/wiki/FASTA_format |
FASTQ | FASTQ |
Bio.Seq |
https://en.wikipedia.org/wiki/FASTQ_format |
.2bit | TwoBit |
Bio.Seq |
http://genome.ucsc.edu/FAQ/FAQformat.html#format7 |
BED | BED |
Bio.Intervals |
https://genome.ucsc.edu/FAQ/FAQformat.html#format1 |
GFF3 | GFF3 |
Bio.Intervals |
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md |
bigBed | BigBed |
Bio.Intervals |
https://doi.org/10.1093/bioinformatics/btq351 |
PDB | PDB |
Bio.Structure |
http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html |
SAM | SAM |
Bio.Align |
https://samtools.github.io/hts-specs/SAMv1.pdf |
BAM | BAM |
Bio.Align |
https://samtools.github.io/hts-specs/SAMv1.pdf |
FASTA
- Reader type:
FASTAReader{S<:Sequence}
- Writer type:
FASTAWriter{T<:IO}
- Element type:
SeqRecord{S,FASTAMetadata}
(alias:FASTASeqRecord{S}
)
FASTA is a text-based file format for representing biological sequences. A FASTA file stores a list of sequence records with name, description, and sequence. The template of a sequence record is:
>{name} {description}? {sequence}
Here is an example of a chromosomal sequence:
>chrI chromosome 1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG
Usually sequence records will be read sequentially from a file by iteration. But if the FASTA file has an auxiliary index file formatted in fai, the reader supports random access to FASTA records, which would be useful when accessing specific parts of a huge genome sequence:
reader = open(FASTAReader, "sacCer.fa", index="sacCer.fa.fai") chrIV = reader["chrIV"] # directly read chromosome 4
#
Bio.Seq.FASTAReader
— Type.
FASTAReader(input::IO; index=nothing) FASTAReader{S}(input::IO; index=nothing)
Create a data reader of the FASTA file format.
When type parameter S
is specified, the reader reads sequences in that type; otherwise the reader tries to infer the sequence type based on the frequencies of characters from the input.
Arguments
input
: data sourceindex=nothing
: filepath to a random access index (currently fai is supported)
#
Bio.Seq.FASTAWriter
— Type.
FASTAWriter(output::IO; width=70)
Create a data writer of the FASTA file format.
Arguments
output
: data sinkwidth=70
: wrapping width of sequence characters
FASTQ
- Reader type:
FASTQReader{S<:Sequence}
- Writer type:
FASTQWriter{T<:IO}
- Element type:
SeqRecord{S,FASTQMetadata}
(alias:FASTQSeqRecord{S}
)
FASTQ is a text-based file format for representing DNA sequences along with qualities for each base. A FASTQ file stores a list of sequence records in the following format:
@{name} {description}? {sequence} + {qualities}
Here is an example of one record from a FASTQ file:
@FSRRS4401BE7HA tcagTTAAGATGGGAT + ###EEEEEEEEE##E#
To read a file containing such records, one could use:
# The default base quality encoding is Sanger. reader = open(FASTQReader, "reads.fastq") for record in reader # do something end close(reader) # If performance is important, in-place reading will be much faster. reader = open(FASTQReader, "reads.fastq") record = FASTQSeqRecord{DNASequence}() while !eof(reader) read!(reader, record) # do something end close(reader)
#
Bio.Seq.FASTQReader
— Type.
FASTQReader(input::IO, quality_encoding=:sanger, fill_ambiguous=nothing)
Create a data reader of the FASTQ file format.
Arguments
input
: data sourcequality_encoding=:sanger
: encoding of base qualities; see the following table for available valuesfill_ambiguous=nothing
: fill ambiguous nucleotides with the given nucleotide
Quality encoding | Symbol | ASCII offset | Quality range |
---|---|---|---|
Sanger | :sanger |
+33 | 0-93 |
Solexa | :solexa |
+64 | -5-62 |
Illumina 1.3+ | :illumina13 |
+64 | 0-62 |
Illumina 1.5+ | :illumina15 |
+64 | 2-62 |
Illumina 1.8+ | :illumina18 |
+33 | 0-93 |
#
Bio.Seq.FASTQWriter
— Type.
FASTQWriter(output::IO; quality_header=false, quality_encoding=:sanger)
Create a data writer of the FASTQ file format.
Arguments
output
: data sinkquality_header=false
: output the title line at the third line just after '+'quality_encoding=:sanger
: encoding of base qualities; seeFASTQReader
for available values
.2bit
- Reader type:
TwoBitReader{T<:IO}
- Writer type:
TwoBitWriter{T<:IO}
- Element type:
SeqRecord{ReferenceSequence,Vector{UnitRange{Int}}}
.2bit is a binary file format designed for storing a genome consists of multiple chromosomal sequences. The reading speed is often an order of magnitude faster than that of FASTA and the file size is smaller. However, since the .2bit file format is specialized for genomic sequences, it cannot store either RNA or amino acid sequences.
Like FASTA, the .2bit reader supports random access using an index included in the header section of a .2bit file:
reader = open(TwoBitReader, "sacCer.2bit") # load a random access index in the header chrIV = reader["chrIV"] # directly read chromosome 4
#
Bio.Seq.TwoBitReader
— Type.
TwoBitReader(input::IO)
Create a data reader of the 2bit file format.
Arguments
input
: data source
#
Bio.Seq.TwoBitWriter
— Type.
TwoBitWriter(output::IO, names::AbstractVector)
Create a data writer of the 2bit file format.
Arguments
output
: data sinknames
: a vector of sequence names written tooutput
BED
- Reader type:
BEDReader
- Writer type:
BEDWriter{T<:IO}
- Element type:
Interval{BEDMetadata}
(alias:BEDInterval
)
BED is a text-based file format for representing genomic annotations like genes, transcripts, and so on. A BED file has tab-delimited and variable-length fields; the first three fields denoting a genomic interval are mandatory.
This is an example of RNA transcripts:
chr9 68331023 68424451 NM_015110 0 + chr9 68456943 68486659 NM_001206 0 -
#
Bio.Intervals.BEDReader
— Type.
BEDReader(input::IO)
Create a data reader of the BED file format.
Arguments
input
: data source
#
Bio.Intervals.BEDWriter
— Type.
BEDWriter(output::IO)
Create a data writer of the BED file format.
Arguments
output
: data sink
GFF3
- Reader type:
GFF3Reader
- Element type:
Interval{GFF3Metadata}
(alias:GFF3Interval
)
GFF3 is a text-based file format for representing genomic annotations. The major difference from BED is that is GFF3 is more structured and can include sequences in the FASTA file format.
#
Bio.Intervals.GFF3Reader
— Type.
GFF3Reader(input::IO, save_directives::Bool=false)
Create a reader for data in GFF3 format.
Arguments:
input
: data sourcesave_directives=false
: if true, store directive lines, which can be accessed with thedirectives
function
bigBed
- Reader type:
BigBedReader
- Writer type:
BigBedWriter{T<:IO}
- Element type:
Interval{BEDMetadata}
(alias:BEDInterval
)
BigBed is a binary file format for representing genomic annotations and often created from BED files. The bigBed files are indexed to quickly fetch specific regions.
#
Bio.Intervals.BigBedReader
— Type.
BigBedReader(input::IO)
Create a data reader of the BigBed file format.
Arguments
input
: data source
#
Bio.Intervals.BigBedWriter
— Type.
BigBedWriter(output::IO)
Create a data writer of the BigBed file format.
Arguments
output
: data sink
PDB
PDB is a text-based file format for representing 3D macromolecular structures. This has different reader interfaces from other file formats. Please consult the Bio.Structure chapter for details.
SAM
- Reader type:
SAMReader
- Writer type:
SAMWriter{T<:IO}
- Element type:
SAMRecord
SAM is a text-based file format for representing sequence alignments.
#
Bio.Align.SAMReader
— Type.
SAMReader(input::IO)
Create a data reader of the SAM file format.
Arguments
input
: data source
#
Bio.Align.SAMWriter
— Type.
SAMWriter(output::IO, header::SAMHeader=SAMHeader())
Create a data writer of the SAM file format.
Arguments
output
: data sinkheader=SAMHeader()
: SAM header object
BAM
- Reader type:
BAMReader
- Writer type:
BAMWriter
- Element type:
BAMRecord
BAM is a binary counterpart of the SAM file format.
When writing data in the BAM file format, the underlying output stream needs to be wrapped with a BGZFStream
object provided from BGZFStreams.jl.
#
Bio.Align.BAMReader
— Type.
BAMReader(input::IO; index=nothing)
Create a data reader of the BAM file format.
Arguments
input
: data sourceindex=nothing
: filepath to a random access index (currently bai is Supported)
#
Bio.Align.BAMWriter
— Type.
BAMWriter(output::BGZFStream, header::SAMHeader)
Create a data writer of the BAM file format.
Arguments
output
: data sinkheader
: SAM header object