Seeking BGZF files

Due to their blocked nature, it is possible to seek BGZF files, and to index them to support seeking to any position in the equivalent decompressed stream.

Currently, only BGZF reader support seeking. Seek support for writers may be added in the future.

Virtual seeking

A BGZF reader is seeked with a virtual offset, which contains two offsets: The file offset, which is the offset in the compressed stream that marks the start of the compressed BGZF block, and the block offset, the offset in the decompressed content of that block.

This is modeled by the VirtualOffset:

BGZFLib.VirtualOffsetType
VirtualOffset(file_offset::Integer, block_offset::Integer)

Create a BGZF virtual file offset from file_offset and block_offset. Get the two offsets with the public properties vo.file_offset and vo.block_offset

A VirtualOffset contains the two zero-indexed offset: The "file offset", which is the offset in the compressed BGZF file that marks the beginning of the block with the given position, and an "block offset" which is the offset of the uncompressed content of that block.

The valid ranges of these two are 0:2^48-1 and 0:2^16-1, respectively.

Examples

julia> reader = SyncBGZFReader(CursorReader(bgzf_data));

julia> vo = VirtualOffset(178, 5)
VirtualOffset(178, 5)

julia> virtual_seek(reader, vo);

julia> String(read(reader, 9))
"some more"

julia> virtual_seek(reader, VirtualOffset(0, 7));

julia> String(read(reader, 6))
"world!"

The current VirtualOffset position is obtained with virtual_position, and seeking is done with virtual_seek:

BGZFLib.virtual_positionFunction
virtual_position(io::Union{SyncBGZFReader, BGZFReader})::VirtualOffset

Get the VirtualOffset of the current BGZF reader. The virtual offset is a position in the decompressed stream. Seek to the position using virtual_seek.

See also: VirtualOffset, virtual_seek

Examples

julia> reader = SyncBGZFReader(CursorReader(bgzf_data));

julia> virtual_position(reader)
VirtualOffset(0, 0)

julia> read(reader, 18);

julia> virtual_position(reader)
VirtualOffset(44, 5)

julia> close(reader)
BGZFLib.virtual_seekFunction
virtual_seek(io::Union{SyncBGZFReader, BGZFReader}, vo::VirtualOffset) -> io

Seek to the virtual position vo. The virtual position is usually obtained by a call to virtual_position.

See also: VirtualOffset, virtual_position

julia> reader = SyncBGZFReader(CursorReader(bgzf_data));

julia> virtual_seek(reader, VirtualOffset(178, 14));

julia> String(read(reader))
"more content herethis is another block"

julia> virtual_seek(reader, VirtualOffset(0, 0));

julia> String(read(reader, 13))
"Hello, world!"

julia> close(reader)

BGZF readers also supports Base.seek. Calling seek(io, x) is equivalent to virtual_seek(io, VirtualOffset(x, 0)):

Base.seekMethod
seek(io::Union{SyncBGZFReader, BGZFReader}, offset::Int)

Seek to the zero-indexed position in the compressed stream offset. This position must be the beginning of a BGZF block, else the reader will error when trying to read after the seek. seek(io, offset) is equivalent to seek(io, VirtualOffset(offset, 0)). seek(io, 0) works, and is equivalent to seekstart(io).

Examples

julia> reader = BGZFReader(CursorReader(bgzf_data));

julia> seek(reader, 44);

julia> read(reader, String)
"more dataxthen some moremore content herethis is another block"

julia> seek(reader, 45); # NB: Not start of BGZF block

julia> read(reader, UInt8)
ERROR: BGZFError: Error in block at offset 0: Error in parsing gzip content: gzip_bad_magic_bytes

julia> close(reader)

Seeking with GZIndex

In order to seek to a certain decompressed offset, e.g. to seek to the 10,000th byte in a decompressed stream, you need to know the offset of the BGZF block that contains this byte in the compressed stream. This can be efficiently obtained with a GZIndex.

An index::GZIndex value contains the (public) property index.blocks, which is a Vector{@NamedTuple{compressed_offset::UInt64, decompressed_offset::UInt64}}, with one element for each block in the corresponding file. All the values of compressed_offset and decompressed_offset are guaranteed to be sorted in ascending order in a GZIndex:

BGZFLib.GZIndexType
GZIndex(blocks::Vector{@NamedTuple{compressed_offset::UInt64, decompressed_offset::UInt64}})

Construct a GZI index of a BGZF file. The vector blocks contains one pair of integers for each block in the BGZF file, in order, containing the zero-based offset of the compressed data and the corresponding decompressed data, respectively.

Throw a BGZFError(nothing, BGZFErrors.unsorted_index) if either of the offsets are not sorted in ascending order.

Usually constructed with index_bgzf, or load_gzi and serialized with write(io, ::GZIndex).

This struct contains the public property .blocks which corresponds to the vector as described above, no matter how GZIndex is constructed.

See also: index_bgzf, load_gzi, write_gzi

With a GZIndex, you can use get_virtual_offset to find the VirtualOffset that corresponds to a given position in the decompressed stream.

BGZFLib.get_virtual_offsetFunction
get_virtual_offset(gzi::GZIndex, offset::Int)::Union{Nothing, VirtualOffset}

Get the VirtualOffset that corresponds to the zero-based offset offset in the decompressed BGZF stream indexed by gzi.

Return nothing if offset is smaller than zero, or points more than 2^16 bytes beyond the start of the final block.

Note that, because gzi files (and thus GZIndex) do not store the length of the final block, the resulting VirtualOffset may be invalid. Specifically, if the resulting VirtualOffset points bo ≤2^16bytes into the final block, but the final block is less thanbobytes, this function will return aVirtualOffset`, but using that offset to seek in the corresponding BGZF stream will error.

Examples

julia> gzi = load_gzi(CursorReader(gzi_data));

julia> get_virtual_offset(gzi, 100_000) === nothing
true

julia> vo = get_virtual_offset(gzi, 45)
VirtualOffset(223, 8)

julia> reader = virtual_seek(SyncBGZFReader(CursorReader(bgzf_data)), vo);

julia> read(reader) |> String
"tent herethis is another block"

julia> bad_vo = get_virtual_offset(gzi, 500)
VirtualOffset(323, 425)

julia> virtual_seek(reader, bad_vo);
ERROR: BGZFError: Error in block at offset 323: Seek to block offset larger than block size
[...]

julia> close(reader)

Building a GZIndex

A GZIndex can be constructed manually from a correct (and sorted) vector v of the above mentioned type using GZIndex(v). More commonly, it is either computed from a BGZF file, or directly loaded from a GZI file:

BGZFLib.index_bgzfFunction
index_bgzf(io::Union{IO, AbstractBufReader})::GZIndex

Compute a GZIndex from a BGZF file.

Throw a BGZFError if the BGZF file is invalid, or a BGZFError with BGZFErrors.insufficient_reader_space if an entire block cannot be buffered by io, (only happens if io::AbstractBufReader).

Indexing the file does not attempt to decompress it, and therefore does not validate that the compressed data is valid (i.e. is a valid DEFLATE payload, or that the crc32 checksum matches).

See also: load_gzi, GZIndex, write_gzi

Examples

julia> idx1 = open(index_bgzf, path_to_bgzf);

julia> idx2 = open(load_gzi, path_to_gzi);

julia> idx1.blocks == idx2.blocks
true
BGZFLib.load_gziFunction
load_gzi(io::Union{IO, AbstractBufReader})::GZIndex

Load a GZIndex from a GZI file.

Throw an IOError(IOErrorKinds.EOF) if io does not contain enough bytes for a valid GZI file. Throw a BGZFError(nothing, BGZFErrors.unsorted_index) if the offsets are not sorted in ascending order. Currently does not throw an error if the file contains extra appended bytes, but this may change in the future.

See also: index_bgzf, GZIndex, write_gzi

Examples

julia> gzi = open(load_gzi, path_to_gzi);

julia> gzi isa GZIndex
true

julia> (; compressed_offset) = gzi.blocks[5]
(compressed_offset = 0x0000000000000093, decompressed_offset = 0x0000000000000017)

julia> reader = SyncBGZFReader(CursorReader(bgzf_data));

julia> seek(reader, Int(compressed_offset));

julia> read(reader, 15) |> String
"then some morem"

julia> close(reader)

Writing GZI files

This is done with write_gzi:

BGZFLib.write_gziFunction
write_gzi(io::Union{AbstractBufWriter, IO}, index::GZIndex)::Int

Write a GZIndex to io in GZI format, and return the number of written bytes. Currently, this function only works on little-endian CPUs, and will throw an ErrorException on big-endian platforms.

The resulting file can be loaded with load_gzi and obtain an index equivalent to index.

See also: GZIndex, index_bgzf

Examples

julia> gzi = load_gzi(CursorReader(gzi_data))::GZIndex;

julia> io = VecWriter();

julia> write_gzi(io, gzi)
152

julia> gzi_2 = load_gzi(CursorReader(io.vec));

julia> gzi.blocks == gzi_2.blocks
true