PairwiseMappingFormat.jl
This package is for reading files of the PAF (Pairwise mApping Format) format, which is a simple tab-delimited format used by e.g. minimap2 and strobealign to store records representing pairwise alignment between biological sequences.
Reader
The PAFReader type wraps an IO, and is an iterator of PAFRecord objects:
reader = PAFReader(open(path_to_paf))
records = collect(reader)
close(reader)
@assert isempty(reader)
@assert typeof(records) == Vector{PAFRecord}Similar to the common open(path) do io-syntax in Julia, PAFReader takes an optional f::Function as a first argument, in which case it will apply f to the returned PAFReader object, and close the underlying io when f returns (or errors):
julia> PAFReader(open(path_to_paf)) do reader
for record in reader
println(record.qlen)
end
end
301156
299273
288659The PAFReader constructor takes an optional keyword copy, which defaults to true. If it is false, the record will iterate the same PAFRecord object, overwriting it in-place. This reduces allocations and gives a slight speed increase, but may result in bugs if records, or references of records of previous iterations are stored and unexpectedly overwritten:
julia> records = PAFReader(collect, open(path_to_paf); copy=false);
julia> println(map(i -> i.qlen, records)) # NB: All the same record!
[288659, 288659, 288659]At the moments, readers do not support seeking.
Record
The mutable PAFRecord object represents a single line in a PAF file. The individual columns of the PAF line is obtained by accessing properties of the records:
julia> record = PAFReader(first, open(path_to_paf));
julia> println(record.qname)
query1
julia> println(record.qlen)
301156
julia> println(record.mapq)
0Note that Base.getproperty is overloaded for PAFRecord, so the properties are public and stable across major versions, but may not reflect the actual underlying fields as they are stored in the PAFRecord.
Fields of PAFRecord
qname::StringView{A}.Ais an implementation detail. Names of the query sequences When the sequences are from a FASTA file, this is typically the identifier of the FASTA records, i.e. the part of the header before the first space.tname::Union{Nothing, StringView{A}}. Same asqname, but for the target (i.e. subject) sequence. May benothingif the record is unmapped.qlenandtlen. Of typeInt. Length of the full query and target sequence.qstartandtstart. Of typeInt. Leftmost (i.e. lowest) position of the alignment in the query and target, respectively. This does not take strandedness into account.qendandtend. Of typeInt. Rightmost (i.e. highest) position of the alignment in the query and target, respectively. This does not take strandedness into account. As such, we always haverecord.qend - record.qtart) >= 0, and likewise for the target.alnlen::Int. Length of alignment. That includes gaps in query and target, and may therefore not match eitherqend - qstart + 1, or that of the target.matches::Int. Number of residue matches (not mismatches) in the alignment.matches / alnlenis the alignment identity and is between 0 and 1.mapq::Union{Int, Nothing}. Mapping quality, from 0:254, where 254 is the better quality. When calibrated, the probability the mapping is correct should be 10^(-mapq / 10). A value ofnothingindicates the mapping quality is unavailable.is_rc::Union{Bool, Nothing}indicates whether the alignment matches the target on the forward (false) or reverse-complementtruestrand. A missing alignment isnothing.
The PAF format does not contain fields for alignment identity, query coverage or target coverage. However, these can be approximated with the functions aln_identity, query_coverage and query_coverage.
Besides being iterated from PAFReader, records can also be created by parsing from a bytes-like object such as a String:
data = "contig_11\t288659\t27\t288653\t+\tCP004047.1\t6701780\t5027225\t5315851\t288618\t288626\t0\ttp:A:P\tcm:i:28871\ts1:i:288618\ts2:i:288618\tdv:f:0.0000\trl:i:57"
record = parse(PAFRecord, data)
@assert record isa PAFRecordRarely, PAF records are unmapped. An unmapped record contain the query id, the query length, and any auxiliary fields. The target name and the strand of an unmapped return is nothing. You can check if a record is mapped or unmapped with the function is_mapped:
julia> unmapped = parse(PAFRecord, "myquery\t5032251\t9\t11\t*\t*\t5\t2\t1\t7\t4\t0");
julia> @assert !is_mapped(unmapped)
julia> println(unmapped.qname)
myquery
julia> println(unmapped.qlen)
5032251
julia> @assert isnothing(unmapped.tname)
julia> # The strand is given by is_rc, and is nothing for unmapped records
@assert isnothing(unmapped.is_rc)All other properties of unmapped records may contain arbitrary data, and should not be relied on. For this reason, the functions target_coverage, query_coverage and aln_identity may give nonsense answers for unmapped records:
julia> # Note! These results are arbitrary and not guaranteed to be stable
# across minor releases of this package
println(target_coverage(unmapped))
Inf
julia> println(query_coverage(unmapped))
1.9871822768776836e-7
julia> println(aln_identity(unmapped))
NaNLow-level interface
Iterating PAFReaders, and the parse function will throw a PairwiseMappingFormat.ParserException if the data is invalid:
julia> parse(PAFRecord, "not a PAF line")
ERROR: Error when parsing PAF record on line 1, near byte number 14 in line: Not enough tab-separated fields. Each line must have at least 12 fields
[...]In some circuomstances, throwing exceptions may not be acceptable, and so PairwiseMappingFormat.jl provide functionality for returning errors as values. These values can then be checked to programmatically handle error states.
The public, unexported try_parse function can be used instead of Base.parse. It either returns a valid PAFRecord, or else returns the ParserException, but does not throw the exception:
julia> const PAF = PairwiseMappingFormat;
julia> println(PAF.try_parse("not a PAF line"))
PairwiseMappingFormat.ParserException(PairwiseMappingFormat.Errors.TooFewFields, 14, 1)Similarly, the next record of a PAFReader may be obtained with the unexported try_next! function. This function will either:
- Return
nothingif the reader has no more records - Return a
ParserExceptionif the next record is invalid - Return a
PAFRecordif there is another valid record. Only in this case it will advance the position of the reader.
In other words, if a call to try_next! returns nothing or a ParserException, any future calls will return the same value forever:
reader = PAFReader(open(path_to_paf))
# Emits `PAFRecord` for valid records
records = [try_next!(reader) for i in 1:3]
@assert records isa Vector{PAFRecord}
# Indefinitely emits `nothing` once the reader is empty
nothings = [try_next!(reader) for i in 1:10]
@assert nothings isa Vector{Nothing}
close(reader)
reader = PAFReader(IOBuffer("not a PAF record"))
err = try_next!(reader)
@assert err isa PAF.ParserException
err2 = try_next!(reader)
@assert err == err2The ParserException type contains the error type as an Enum, and the line where the exception occurred. These can be obtained with the .kind and .line properties.
julia> err.line
1
julia> err.kind
TooFewFields::Err = 0The precise value of this .kind object for any given error condition is subject to change across minor versions and new values may be introduced. This is because the same error may be detected in multiple different ways.
Auxiliary fields
PAFRecords may contain extra auxiliary fields at the end of the records, similar to SAM records. Any auxiliary data is stored in the PAFRecord, and lazily parsed and validated as they are accessed. This means that a PAFRecord may contain invalid auxiliary data.
The aux_data function constructs a lazy SAM.Auxiliary object from the package XAMAuxData.jl. The precise semantics of this Auxiliary type is a little complicated, and can be found in the documentation of XAMAuxData.jl. Here is a small example:
julia> line = first(eachline(path_to_paf));
julia> aux_string = join(split(line, '\t')[13:end], '\t')
"tp:A:P\tcm:i:29990\tdv:f:0.0000\tkm:Z:some text here"
julia> record = parse(PAFRecord, line);
julia> aux = aux_data(record);
julia> isvalid(aux)
true
julia> aux["tp"]
'P': ASCII/Unicode U+0050 (category Lu: Letter, uppercase)
julia> aux["km"]
"some text here"
julia> haskey(aux, "cm")
true
julia> aux["AB"] = 5.5;
julia> haskey(aux, "AB")
trueMisc info
- The
PAFReadercannot handle trailing whitespace, except trailing newlines at the end of the file. This is because trailing whitespace may be significant in some records, e.g. if it ends with an auxiliary field of the element typeZ. A PAF record with trailing whitespace is considered invalid.
Reference
PairwiseMappingFormat.PairwiseMappingFormat — Module
module PairwiseMappingFormatParse files of the PairwisemAppingFormat (PAF) format.
PairwiseMappingFormat.PAFReader — Type
PAFReader(io::IO; copy::Bool=true)Construct a PAFReader, an iterator of PAFRecord read from io. Iterating this object returns a PAFRecord for each valid line of input, and throws a ParserException when an invalid record is found. For efficiency, the auxiliary fields are not validated until they are accessed.
If copy is false, the same record will be returned on each iteration, with its content being overwritten. This removes a few allocations per iteration, but may cause problems if records of old iterations are stored.
Examples
julia> reader = PAFReader(open(path_to_paf)); typeof(reader)
PAFReader{IOStream}
julia> typeof(first(reader))
PAFRecord
julia> PAFReader(open(path_to_paf)) do reader
for record in reader
println(record.qlen)
end
end
301156
299273
288659
julia> PAFReader(open(path_to_paf); copy=false) do reader
fst = first(reader)
all(reader) do record
record === fst
end
end
truePairwiseMappingFormat.PAFRecord — Type
PAFRecord(buffer_size::Int)Mutable type representing a single PAF line. The content of the record is accessed through its properties.
Examples
julia> record = PAFReader(first, open(path_to_paf));
julia> record.qname
"query1"
julia> record.qlen
301156See also: PAFReader
Extended help
The following properties may be used:
qname::StringView. The query name, May be empty, and contain any bytes except\tand\n.tname::Union{StringView, Nothing}. Target name. Likeqname, but isnothingif and only if the record is unmapped.qlenandtlen::Int, and gives the length of the query and target sequences, respectively. This must be > 0.qstart,qend,tstartandtend::Int, and give the starting and ending positions of the alignments on the query and target strand. These uses one-based, closed interval indexing as usual in Julia, but unlike the underlying PAF format. The ending positions are always >= the starting ones, and the ending positions are always <= the query/target lengths.matches::Intgives the number of nucleotides / residues which match (i.e. are equal) in the alignment.alnlen::Intgives the length of the alignmentmapq::Union{Int, Nothing}gives the mapping quality, ornothingif this information is unavailable.is_rc::Union{Bool,Nothing}tells if the query and target strands are reverse-complement relative to each other. Isnothingif the record is unmapped.
PairwiseMappingFormat.aux_data — Function
aux_data(rec::Record) -> XAMAuxData.SAM.AuxiliaryReturn a lazily evaluated and validated SAM.Auxiliary, which is an AbstractDict of the auxiliary fields at the end of the record. For more details about this object, read the documentation of the package XAMAuxData.jl.
Examples
julia> aux = aux_data(record);
julia> length(aux)
4
julia> aux["cm"]
29990
julia> aux["k1"] = "add a new aux string like this";
julia> haskey(aux, "k1")
truePairwiseMappingFormat.aln_identity — Function
aln_identity(rec::Record) -> Float64Return the nucleotide / residue identity, defined as the number of matches divided by the alignment length. This is a number between 0 and 1.
Examples
julia> aln_identity(record)
1.0See also: query_coverage, target_coverage
PairwiseMappingFormat.query_coverage — Function
query_coverage(rec::Record) -> Float64Compute the fraction of the query covered by the alignment.
Examples
julia> query_coverage(record)
0.9999535124653004See also: target_coverage, aln_identity
PairwiseMappingFormat.target_coverage — Function
target_coverage(rec::Record) -> Float64Compute the fraction of the target covered by the alignment.
Examples
julia> target_coverage(record)
0.044934629307437725See also: query_coverage, aln_identity
PairwiseMappingFormat.try_next! — Function
try_next!(reader::PAFReader) -> Union{Nothing, PAFRecord, ParserException}Try to read a record from the PAFReader and advance the reader. This may yield one of three options:
- If the reader is empty, return
nothingand do not advance the reader - If the next record is invalid, return (do not throw) a
ParserExceptionand do not advance the reader - If the next record is valid, return it as a
PAFRecordand advance the reader
Even if the reader itself is not advanced, it may still fill its internal buffer, advancing the IO object it wraps.
Examples
julia> reader = PAFReader(open(path_to_paf));
julia> try_next!(reader) isa PAFRecord
true
julia> [try_next!(reader) for i in 1:2] isa Vector{PAFRecord}
true
julia> typeof([try_next!(reader) for i in 1:100])
Vector{Nothing} (alias for Array{Nothing, 1})
julia> close(reader); reader = PAFReader(IOBuffer("bad data"));
julia> try_next!(reader) isa PAF.ParserException
true
julia> all(i -> try_next!(reader) isa PAF.ParserException, 1:100)
truePairwiseMappingFormat.is_mapped — Function
is_mapped(x::PAFRecord) -> BoolCompute whether the PAFRecord is mapped. An unmapped record will have the properties is_rc and tname unavailable. The properties qname and qlen, and the auxiliary data of an unmapped record can be relied on, but the remaining properties contain arbitrary data. Note that some PAF parsers do not handle unmapped records correctly, so be wary when writing unmapped records.
Examples
julia> is_mapped(record)
trueSee also: PAFRecord
PairwiseMappingFormat.Errors — Module
module ErrorsWrapper module used to contain the instances of error types, and not pollute the namespace of the package.
Examples
julia> print(PAF.Errors.InvalidZero)
InvalidZeroSee also: PairwiseMappingFormat.ParserException
PairwiseMappingFormat.try_parse — Function
try_parse(x) -> Union{PAFRecord, ParserException}Try parsing x into a PAFRecord. x may be any type that implements MemoryView, such as String or Vector{UInt8}.
Examples
julia> valid_line = open(readline, path_to_paf);
julia> typeof(PAF.try_parse(valid_line))
PAFRecord
julia> typeof(PAF.try_parse("invalid string"))
PairwiseMappingFormat.ParserExceptionPairwiseMappingFormat.ParserException — Type
ParserExceptionThe exception type thrown by iterating PAFReader, or a failed parse of a PAFRecord. The functions try_parse and try_next! may return (not throw) values of this type. The line the error occurred may be accessed with the .line field. The error kind may be accessed with the .kind field.
Examples
julia> err = PAF.try_parse("abc");
julia> err.line
1
julia> print(err.kind)
TooFewFields