Creating a Reader
type
The use of generate_reader
as we learned in the previous section "Parsing from an io" has an issue we need to address: While we were able to read multiple records from the reader by calling read_record
multiple times, no state was preserved between these calls, and so, no state can be preserved between reading individual records. This is also what made it necessary to clumsily reset p
after emitting each record.
Imagine you have a format with two kinds of records, A and B types. A records must come before B records in the file. Hence, while a B record can appear at any time, once you've seen a B record, there can't be any more A records. When reading records from the file, you must be able to store whether you've seen a B record.
We address this by creating a Reader
type which wraps the IO being parsed, and which store any state we want to preserve between records. Let's stick to our simplified FASTA format parsing sequences into Seq
objects:
struct Seq
name::String
seq::String
end
machine = let
header = onexit!(onenter!(re"[a-z]+", :mark_pos), :header)
seqline = onexit!(onenter!(re"[ACGT]+", :mark_pos), :seqline)
record = onexit!(re">" * header * '\n' * rep1(seqline * '\n'), :record)
compile(rep(record))
end
@assert machine isa Automa.Machine
This time, we use the following Reader
type:
mutable struct Reader{S <: TranscodingStream}
io::S
automa_state::Int
end
Reader(io::TranscodingStream) = Reader{typeof(io)}(io, 1)
Reader(io::IO) = Reader(NoopStream(io))
The Reader
contains an instance of TranscodingStream
to read from, and stores the Automa state between records. The beginning state of Automa is always 1. We can now create our reader function like below. There are only three differences from the definitions in the previous section:
- I no longer have the code to decrement
p
in the:record
action - because we can store the Automa state between records such that the machine can handle beginning in the middle of a record if necessary, there is no need to reset the value ofp
in order to restore the IO to the state right before each record. - I return
(cs, state)
instead of juststate
, because I want to update the Automa state of the Reader, so when it reads the next record, it begins in the same state where the machine left off from the previous state - In the arguments, I add
start_state
, and in theinitcode
I setcs
to the start state, so the machine begins from the correct state
actions = Dict{Symbol, Expr}(
:mark_pos => :(@mark),
:header => :(header = String(data[@markpos():p-1])),
:seqline => :(append!(seqbuffer, data[@markpos():p-1])),
:record => quote
seq = Seq(header, String(seqbuffer))
found_sequence = true
@escape
end
)
generate_reader(
:read_record,
machine;
actions=actions,
arguments=(:(start_state::Int),),
initcode=quote
seqbuffer = UInt8[]
found_sequence = false
header = ""
cs = start_state
end,
loopcode=quote
if (is_eof && p > p_end) || found_sequence
@goto __return__
end
end,
returncode=:(found_sequence ? (cs, seq) : throw(EOFError()))
) |> eval
We then create a function that reads from the Reader
, making sure to update the automa_state
of the reader:
function read_record(reader::Reader)
(cs, seq) = read_record(reader.io, reader.automa_state)
reader.automa_state = cs
return seq
end
Let's test it out:
julia> reader = Reader(IOBuffer(">a\nT\n>tag\nGAG\nATATA\n"));
julia> read_record(reader)
Seq("a", "T")
julia> read_record(reader)
Seq("tag", "GAGATATA")
julia> read_record(reader)
ERROR: EOFError: read end of file