PairwiseMappingFormat.jl

This package is for reading files of the PAF (Pairwise mApping Format) format, which is a simple tab-delimited format used by e.g. minimap2 and strobealign to store records representing pairwise alignment between biological sequences.

Reader

The PAFReader type wraps an IO, and is an iterator of PAFRecord objects:

reader = PAFReader(open(path_to_paf))
records = collect(reader)
close(reader)

@assert isempty(reader)
@assert typeof(records) == Vector{PAFRecord}

Similar to the common open(path) do io-syntax in Julia, PAFReader takes an optional f::Function as a first argument, in which case it will apply f to the returned PAFReader object, and close the underlying io when f returns (or errors):

julia> PAFReader(open(path_to_paf)) do reader
           for record in reader
               println(record.qlen)
           end
       end
301156
299273
288659

The PAFReader constructor takes an optional keyword copy, which defaults to true. If it is false, the record will iterate the same PAFRecord object, overwriting it in-place. This reduces allocations and gives a slight speed increase, but may result in bugs if records, or references of records of previous iterations are stored and unexpectedly overwritten:

julia> records = PAFReader(collect, open(path_to_paf); copy=false);

julia> println(map(i -> i.qlen, records)) # NB: All the same record!
[288659, 288659, 288659]

At the moments, readers do not support seeking.

Record

The mutable PAFRecord object represents a single line in a PAF file. The individual columns of the PAF line is obtained by accessing properties of the records:

julia> record = PAFReader(first, open(path_to_paf));

julia> println(record.qname)
query1

julia> println(record.qlen)
301156

julia> println(record.mapq)
0

Note that Base.getproperty is overloaded for PAFRecord, so the properties are public and stable across major versions, but may not reflect the actual underlying fields as they are stored in the PAFRecord.

Fields of PAFRecord

  • qname::StringView{A}. A is an implementation detail. Names of the query sequences When the sequences are from a FASTA file, this is typically the identifier of the FASTA records, i.e. the part of the header before the first space.
  • tname::Union{Nothing, StringView{A}}. Same as qname, but for the target (i.e. subject) sequence. May be nothing if the record is unmapped.
  • qlen and tlen. Of type Int. Length of the full query and target sequence.
  • qstart and tstart. Of type Int. Leftmost (i.e. lowest) position of the alignment in the query and target, respectively. This does not take strandedness into account.
  • qend and tend. Of type Int. Rightmost (i.e. highest) position of the alignment in the query and target, respectively. This does not take strandedness into account. As such, we always have record.qend - record.qtart) >= 0, and likewise for the target.
  • alnlen::Int. Length of alignment. That includes gaps in query and target, and may therefore not match either qend - qstart + 1, or that of the target.
  • matches::Int. Number of residue matches (not mismatches) in the alignment. matches / alnlen is the alignment identity and is between 0 and 1.
  • mapq::Union{Int, Nothing}. Mapping quality, from 0:254, where 254 is the better quality. When calibrated, the probability the mapping is correct should be 10^(-mapq / 10). A value of nothing indicates the mapping quality is unavailable.
  • is_rc::Union{Bool, Nothing} indicates whether the alignment matches the target on the forward (false) or reverse-complement true strand. A missing alignment is nothing.

The PAF format does not contain fields for alignment identity, query coverage or target coverage. However, these can be approximated with the functions aln_identity, query_coverage and query_coverage.

Besides being iterated from PAFReader, records can also be created by parsing from a bytes-like object such as a String:

data = "contig_11\t288659\t27\t288653\t+\tCP004047.1\t6701780\t5027225\t5315851\t288618\t288626\t0\ttp:A:P\tcm:i:28871\ts1:i:288618\ts2:i:288618\tdv:f:0.0000\trl:i:57"

record = parse(PAFRecord, data)
@assert record isa PAFRecord

Rarely, PAF records are unmapped. An unmapped record contain the query id, the query length, and any auxiliary fields. The target name and the strand of an unmapped return is nothing. You can check if a record is mapped or unmapped with the function is_mapped:

julia> unmapped = parse(PAFRecord, "myquery\t5032251\t9\t11\t*\t*\t5\t2\t1\t7\t4\t0");

julia> @assert !is_mapped(unmapped)

julia> println(unmapped.qname)
myquery

julia> println(unmapped.qlen)
5032251

julia> @assert isnothing(unmapped.tname)

julia> # The strand is given by is_rc, and is nothing for unmapped records
       @assert isnothing(unmapped.is_rc)

All other properties of unmapped records may contain arbitrary data, and should not be relied on. For this reason, the functions target_coverage, query_coverage and aln_identity may give nonsense answers for unmapped records:

julia> # Note! These results are arbitrary and not guaranteed to be stable
       # across minor releases of this package
       println(target_coverage(unmapped))
Inf

julia> println(query_coverage(unmapped))
1.9871822768776836e-7

julia> println(aln_identity(unmapped))
NaN

Low-level interface

Iterating PAFReaders, and the parse function will throw a PairwiseMappingFormat.ParserException if the data is invalid:

julia> parse(PAFRecord, "not a PAF line")
ERROR: Error when parsing PAF record on line 1, near byte number 14 in line: Not enough tab-separated fields. Each line must have at least 12 fields
[...]

In some circuomstances, throwing exceptions may not be acceptable, and so PairwiseMappingFormat.jl provide functionality for returning errors as values. These values can then be checked to programmatically handle error states.

The public, unexported try_parse function can be used instead of Base.parse. It either returns a valid PAFRecord, or else returns the ParserException, but does not throw the exception:

julia> const PAF = PairwiseMappingFormat;

julia> println(PAF.try_parse("not a PAF line"))
PairwiseMappingFormat.ParserException(PairwiseMappingFormat.Errors.TooFewFields, 14, 1)

Similarly, the next record of a PAFReader may be obtained with the unexported try_next! function. This function will either:

  • Return nothing if the reader has no more records
  • Return a ParserException if the next record is invalid
  • Return a PAFRecord if there is another valid record. Only in this case it will advance the position of the reader.

In other words, if a call to try_next! returns nothing or a ParserException, any future calls will return the same value forever:

reader = PAFReader(open(path_to_paf))

# Emits `PAFRecord` for valid records
records = [try_next!(reader) for i in 1:3]
@assert records isa Vector{PAFRecord}

# Indefinitely emits `nothing` once the reader is empty
nothings = [try_next!(reader) for i in 1:10]
@assert nothings isa Vector{Nothing}

close(reader)

reader = PAFReader(IOBuffer("not a PAF record"))
err = try_next!(reader)

@assert err isa PAF.ParserException
err2 = try_next!(reader)
@assert err == err2

The ParserException type contains the error type as an Enum, and the line where the exception occurred. These can be obtained with the .kind and .line properties.

julia> err.line
1

julia> err.kind
TooFewFields::Err = 0

The precise value of this .kind object for any given error condition is subject to change across minor versions and new values may be introduced. This is because the same error may be detected in multiple different ways.

Auxiliary fields

PAFRecords may contain extra auxiliary fields at the end of the records, similar to SAM records. Any auxiliary data is stored in the PAFRecord, and lazily parsed and validated as they are accessed. This means that a PAFRecord may contain invalid auxiliary data.

The aux_data function constructs a lazy SAM.Auxiliary object from the package XAMAuxData.jl. The precise semantics of this Auxiliary type is a little complicated, and can be found in the documentation of XAMAuxData.jl. Here is a small example:

julia> line = first(eachline(path_to_paf));

julia> aux_string = join(split(line, '\t')[13:end], '\t')
"tp:A:P\tcm:i:29990\tdv:f:0.0000\tkm:Z:some text here"

julia> record = parse(PAFRecord, line);

julia> aux = aux_data(record);

julia> isvalid(aux)
true

julia> aux["tp"]
'P': ASCII/Unicode U+0050 (category Lu: Letter, uppercase)

julia> aux["km"]
"some text here"

julia> haskey(aux, "cm")
true

julia> aux["AB"] = 5.5;

julia> haskey(aux, "AB")
true

Misc info

  • The PAFReader cannot handle trailing whitespace, except trailing newlines at the end of the file. This is because trailing whitespace may be significant in some records, e.g. if it ends with an auxiliary field of the element type Z. A PAF record with trailing whitespace is considered invalid.

Reference

PairwiseMappingFormat.PAFReaderType
PAFReader(io::IO; copy::Bool=true)

Construct a PAFReader, an iterator of PAFRecord read from io. Iterating this object returns a PAFRecord for each valid line of input, and throws a ParserException when an invalid record is found. For efficiency, the auxiliary fields are not validated until they are accessed.

If copy is false, the same record will be returned on each iteration, with its content being overwritten. This removes a few allocations per iteration, but may cause problems if records of old iterations are stored.

Examples

julia> reader = PAFReader(open(path_to_paf)); typeof(reader)
PAFReader{IOStream}

julia> typeof(first(reader))
PAFRecord

julia> PAFReader(open(path_to_paf)) do reader
           for record in reader
                println(record.qlen)
           end
       end
301156
299273
288659

julia> PAFReader(open(path_to_paf); copy=false) do reader
           fst = first(reader)
           all(reader) do record
               record === fst
           end
       end
true

See also: PAFRecord, try_next!

PairwiseMappingFormat.PAFRecordType
PAFRecord(buffer_size::Int)

Mutable type representing a single PAF line. The content of the record is accessed through its properties.

Examples

julia> record = PAFReader(first, open(path_to_paf));

julia> record.qname
"query1"

julia> record.qlen
301156

See also: PAFReader

Extended help

The following properties may be used:

  • qname::StringView. The query name, May be empty, and contain any bytes except \t and \n.
  • tname::Union{StringView, Nothing}. Target name. Like qname, but is nothing if and only if the record is unmapped.
  • qlen and tlen::Int, and gives the length of the query and target sequences, respectively. This must be > 0.
  • qstart, qend, tstart and tend::Int, and give the starting and ending positions of the alignments on the query and target strand. These uses one-based, closed interval indexing as usual in Julia, but unlike the underlying PAF format. The ending positions are always >= the starting ones, and the ending positions are always <= the query/target lengths.
  • matches::Int gives the number of nucleotides / residues which match (i.e. are equal) in the alignment.
  • alnlen::Int gives the length of the alignment
  • mapq::Union{Int, Nothing} gives the mapping quality, or nothing if this information is unavailable.
  • is_rc::Union{Bool,Nothing} tells if the query and target strands are reverse-complement relative to each other. Is nothing if the record is unmapped.
PairwiseMappingFormat.aux_dataFunction
aux_data(rec::Record) -> XAMAuxData.SAM.Auxiliary

Return a lazily evaluated and validated SAM.Auxiliary, which is an AbstractDict of the auxiliary fields at the end of the record. For more details about this object, read the documentation of the package XAMAuxData.jl.

Examples

julia> aux = aux_data(record);

julia> length(aux)
4

julia> aux["cm"]
29990

julia> aux["k1"] = "add a new aux string like this";

julia> haskey(aux, "k1")
true
PairwiseMappingFormat.try_next!Function
try_next!(reader::PAFReader) -> Union{Nothing, PAFRecord, ParserException}

Try to read a record from the PAFReader and advance the reader. This may yield one of three options:

  • If the reader is empty, return nothing and do not advance the reader
  • If the next record is invalid, return (do not throw) a ParserException and do not advance the reader
  • If the next record is valid, return it as a PAFRecord and advance the reader

Even if the reader itself is not advanced, it may still fill its internal buffer, advancing the IO object it wraps.

Examples

julia> reader = PAFReader(open(path_to_paf));

julia> try_next!(reader) isa PAFRecord
true

julia> [try_next!(reader) for i in 1:2] isa Vector{PAFRecord}
true

julia> typeof([try_next!(reader) for i in 1:100])
Vector{Nothing} (alias for Array{Nothing, 1})

julia> close(reader); reader = PAFReader(IOBuffer("bad data"));

julia> try_next!(reader) isa PAF.ParserException
true

julia> all(i -> try_next!(reader) isa PAF.ParserException, 1:100)
true
PairwiseMappingFormat.is_mappedFunction
is_mapped(x::PAFRecord) -> Bool

Compute whether the PAFRecord is mapped. An unmapped record will have the properties is_rc and tname unavailable. The properties qname and qlen, and the auxiliary data of an unmapped record can be relied on, but the remaining properties contain arbitrary data. Note that some PAF parsers do not handle unmapped records correctly, so be wary when writing unmapped records.

Examples

julia> is_mapped(record)
true

See also: PAFRecord

PairwiseMappingFormat.try_parseFunction
try_parse(x) -> Union{PAFRecord, ParserException}

Try parsing x into a PAFRecord. x may be any type that implements MemoryView, such as String or Vector{UInt8}.

Examples

julia> valid_line = open(readline, path_to_paf);

julia> typeof(PAF.try_parse(valid_line))
PAFRecord

julia> typeof(PAF.try_parse("invalid string"))
PairwiseMappingFormat.ParserException

See also: PAFRecord, try_next!

PairwiseMappingFormat.ParserExceptionType
ParserException

The exception type thrown by iterating PAFReader, or a failed parse of a PAFRecord. The functions try_parse and try_next! may return (not throw) values of this type. The line the error occurred may be accessed with the .line field. The error kind may be accessed with the .kind field.

Examples

julia> err = PAF.try_parse("abc");

julia> err.line
1

julia> print(err.kind)
TooFewFields