BGZF readers
BGZFReader and SyncBGZFReader
BGZFLib exposes two readers:
BGZFReaderuses a number of worker tasks to decompress concurrently with reading. This gives the highest performance and should be the default choice. However, in my experience Julia's task scheduler is not fantastic and can cause performance problems.SyncBGZFReaderavoids task scheduling by decompressing in the reading task. Therefore, it neither decompresses in parallel, nor concurrent with reading, which makes it slower. However, it is simpler, less likely to have bugs, and does not strain the scheduler.
Most users should use BGZFReader.
Constructing BGZF readers
Both readers make use of the AbstractBufReader interface. They both contain un-expandable buffers, i.e. fill_buffer generally cannot expand the buffer size and will return nothing if the buffer is not empty.
Readers wrap another AbstractBufReader containing the compressed BGZF stream:
using BufferIO
using BGZFLib
reader = CursorReader(bgzf_data);
@assert reader isa AbstractBufReader
bgzf_reader = BGZFReader(reader)
println(String(read(bgzf_reader)))
close(bgzf_reader)
# output
Hello, world!more dataxthen some moremore content herethis is another blockThe AbstractBufReader must be able to buffer a full BGZF block, or the remainder of the underlying file, whichever is smallest.
If an AbstractBufReader with less space is provided, a BGZFError(nothing, BGZFErrors.insufficient_reader_space) is thrown.
If a reader is constructed from an IO, it is automatically wrapped in a BufReader with an appropriate buffer size:
bgzf_reader = SyncBGZFReader(IOBuffer(bgzf_data))
println(String(read(bgzf_reader)))
close(bgzf_reader)
# output
Hello, world!more dataxthen some moremore content herethis is another blockMutating the wrapped io object of a BGZF reader or writer is not permitted and can cause erratic behaviour.
Reading BGZF files
Readers have a check_truncated keyword that defaults to true. If set to true, the reader will error if the last block in the stream is not empty, marking EOF:
Example with check_truncated = true (default)
reader = CursorReader(bgzf_data[1:end-28])
String(read(BGZFReader(reader)))
# output
ERROR: BGZFError: Error in block at offset 0: BGZF file ends without EOF marker block, or block is malformed by being too short
[...]And set to false
reader = CursorReader(bgzf_data[1:end-28])
String(read(BGZFReader(reader; check_truncated = false)))
# output
"Hello, world!more dataxthen some moremore content herethis is another block"Like Base.open, the BGZF readers also have a method that takes a function as a first argument, and makes sure to close the reader even if it errors:
SyncBGZFReader(io -> String(read(io)), CursorReader(bgzf_data))
# output
"Hello, world!more dataxthen some moremore content herethis is another block"Errors and error recovery
Types in this package generally throw BGZFErrors:
BGZFLib.BGZFError — Type
BGZFError <: ExceptionException type thrown by BGZF readers and writers, when encountering errors specific to the BGZF (or gzip, or DEFLATE) formats. Note that exceptions thrown by BGZF readers and writers are not guaranteed to be of this type, as they may also throw BufferIO.IOErrors, or exceptions propagated by their underlying IO.
This error contains two public properties:
block_offset::Union{Nothing, Int}gives the zero-based offset in the compressed stream of the block where the error occurred. Some errors may not occur at a specific block, in which case this isnothing.type::Union{BGZFErrorType, LibDeflateError}. If the blocks are malformed gzip blocks, this is aLibDeflateError. Else, if the error is specific to the BGZF format, it's a BGZFErrorType.
BGZFLib.BGZFErrors — Module
module BGZFErrorsThis module is used as a namespace for the enum BGZFErrorType. The enum is non-exhaustive (more variants may be added in the future). The current values are:
truncated_file: The reader data stops abruptly. Either in the middle of a block, or there is no empty block at EOFmissing_bc_field: A block has noBCfield, or it's malformedblock_offset_out_of_bounds: Seek with aVirtualOffsetwhere the block offset is larger than the block sizeinsufficient_reader_space: The BGZF reader wraps anAbstractBufWriterthat is not EOF, and its buffer can't grow to encompass a whole BGZF blockinsufficient_writer_space: A BGZF writer wraps anAbstractBufWriterwhose buffer cannot grow to encompass a full BGZF blockunsorted_index: Attempted to load a malformed GZI file with unsorted coordinates, or with a file index > 2^48, or with a block size > 2^16.operation_on_error: Attempted an operation on a BGZF reader or writer in an error state.
However, some operations on BGZF readers and writers propagate to their underlying IO, which may throw different errors. For example, when calling seek on a BGZF reader wrapping a file (e.g. SyncBGZFReader{BufReader{IOStream}}), seek is also called on the underlying IOStream. This may throw another error.
When attempting to read a malformed BGZF file, the reader will throw a BGZFError and be in an error state. In this state, some operations like BufferIO.fill_buffer and Base.seek will throw a BGZFError(nothing, BGZFErrors.operation_on_error).
To recover the BGZF reader, seek to a valid position:
bad_data = append!(copy(bgzf_data), "some bad data")
reader = BGZFReader(CursorReader(bad_data))
# Trigger an error from reading bad gzip data
try
read(reader)
catch error
@assert error isa BGZFError
@assert error.type isa LibDeflateError
end
# Trying to read from a reader in an error state will throw
# a BGZF error.
# Note that e.g. calling `read` would throw the same error
try
fill_buffer(reader)
catch error
@assert error isa BGZFError
@assert error.type === BGZFErrors.operation_on_error
end
# Reset the reader by calling `seek`. If seek succeeds, the error
# state will disappear.
seekstart(reader)
println(String(read(reader, 13)))
close(reader)
# output
Hello, world!
Reference
BGZFLib.BGZFReader — Type
BGZFReader(io::T <: IO; n_workers::Int, check_truncated::Bool=true)::BGZFReader{BufReader{T}}
BGZFReader(io::T <: AbstractBufReader; n_workers::Int, check_truncated::Bool=true)::BGZFReader{T}Create a BGZFReader <: AbstractBufReader that decompresses a BGZF stream.
When constructing from an io::AbstractBufReader, io must have a buffer size of at least 65536, or be able to grow its buffer to this size.
If check_truncated, the last BGZF block in the file must be empty, otherwise the reader throws an error. This can be used to detect the file was truncated.
The decompression happens asyncronously in a set of worker tasks. To avoid spawning workers, use the SyncBGZFReader instead.
If the reader encounters an error, it goes into an error state and throws an exception. The reader can be reset by using seek or seekstart. A closed reader cannot be reset.
BGZFLib.SyncBGZFReader — Type
SyncBGZFReader(io::T <: IO; check_truncated::Bool=true)::SyncBGZFReader{BufReader{T}}
SyncBGZFReader(io::T <: AbstractBufReader; check_truncated::Bool=true)::SyncBGZFReader{T}Create a SyncBGZFReader <: AbstractBufReader that decompresses BGZF files.
When constructing from an io::AbstractBufReader, io must have a buffer size of at least 65536, or be able to grow its buffer to this size.
If check_truncated, the last BGZF block in the file must be empty, otherwise the reader throws an error. This can be used to detect the file was truncated.
Unlike BGZFReader, the decompression happens in in serial in the main task. This is slower and does not enable paralellism, but may be preferable in situations where task scheduling or contention is an issue.
If the reader encounters an error, it goes into an error state and throws an exception. The reader can be reset by using seek or seekstart. A closed reader cannot be reset.