Find this notebook at https://github.com/jakobnissen//assets/posts/automa1/automa_tutorial
In bioinformatics, we have a saying:
The first step of any bioinformatics project is to define a new file format, incompatible with all previous ones.
The situation might not be quite as bad as the saying implies, but it is true that we have a lot of different file formats, representing the various kinds of data we work with.
For that reason, creating file parsers is a central task in bioinformatics, and has almost become a craft in itself. An awful, awful craft. For anything but the simplest formats, designing robust parsers is hard, and it is difficult to know if a parser you've built is solid. Insignificant details like trailing whitespace can break a seemingly well-built parser.
At its core, a parser is a program that checks whether some input data comforms to some format. For bioinformatics, we typically also want to load the data in the file into some data structure to work on it. The format itself is typically specified in some kind of higher-level, but unambiguous language, like this description of the Newick format. A parser then, can be viewed as a program that:
(
, )
, ,
or ;
match the pattern [A-Za-z_]
The central job of a parser is then to act as a translator between different abstraction levels: The format is specified in a high-level language, but the actual data processing must happen at a lower level language.
When described like that, parser generation seems quite analogous to code compilation, and begins too look like a machine's job, not a human's. Machines are fast and accurate, and do not make oversights or mistakes - exactly the characteristics you want when making a parser. Let's leave the machine tasks to the machines; we humans ought to put our effort into what we do best: Specifying the formats themselves.
In this tutorial, I will introduce Automa, a Julia package for creating parsers. Actually, when reading this you might be able to tell that Automa solves a problem slightly more general than just creating parsers, but creating parsers is what we will use it for in this tutorial. Automa is not an easy package to use; it is complex and a little opaque, but it's worth all the effort and more. Parsers generated by Automa have several advantages over human-made parsers:
In this first part of the tutorial I will show how to create parsers from data loaded into memory in order to highlight Automa itself. In the second part, which I will release at a later date, I will show how to create a proper BioJulia reader and writer for a file format.
For our example, let us begin with a simlified version of the FASTA format - we can call it “SimpleFasta”. Our format will be defined as something that looks like this:
>first
TGATAGTAGTCGTGTGATA
>second
AGGACCCAATTATCGGGGTAA
I.e.
>
followed by one or more characters from the alphabet a-zWe can visualize a hypothetical SimpleFasta parser using a simple flowchart - note: You need to have graphviz
installed to be able to run the code below.
s = """digraph {
graph [ rankdir = LR ];
begin [ shape = circle ];
begin -> header;
header [ shape = circle ];
header -> sequence;
sequence [ shape = circle ];
sequence -> end;
end [ shape = circle ];
sequence -> header;
begin -> end;
}
"""
function display_machine()
write("/tmp/fasta_machine.dot", s)
run(`dot -Tsvg -o /tmp/fasta_machine.svg /tmp/fasta_machine.dot`)
open("/tmp/fasta_machine.svg") do file
display("image/svg+xml", String(read(file)))
end
end;
display_machine()
After the beginning of the file, we expect to see a header. Next, we move to the sequence. From there, we can either reach the end of the file, or loop back to another header. Alternatively, in the case of an empty file, we may move from the beginning of the file to the end immediately without any headers or sequences. We consider an empty file to also be specification-compliant.
Note that this simple flowchart describes any valid SimpleFasta format. But if, for example, a file had two consecutive header lines, there is no path we can take through the flowchart from “begin” to “end” which describes the file, and we can tell that the input did not conform to the format.
The circles marked “begin”, “header”, “sequence” or “end” are what we call the four states. Because the parser shown in the diagram has a fixed list of possible states that it changes between, we call the parser a finite state machine (FSM), or finite state automaton (FSA).
The flowchart above doesn't do a very good job of describing the FSM that our parser is, because it doesn't tell you how the machine transitions between states. The state transitions are what actually defines the format. For example, we know we are transitioning to a header when we see a >
symbol - that is, in a way, what defines the header state.
In the flowchart below, the arrows signifying transitions between states are marked with the input character(s) that causes the FSM (our parser) to transition, written in a regex-like notation. For example, the transition from a header to a sequence is defined by a \n
followed by one of the characters in [ACGT]
. “EOF” signifies end of file. Note also that some states have transitions leading to itself. For example, when in the “header” state, the FSM may continue to read characters in a-z
and stay in the “header” state.
s = """digraph {
graph [ rankdir = LR ];
begin [ shape = circle ];
header [ shape = circle ];
sequence [ shape = circle ];
end [ shape = circle ];
begin -> header [ label = ">[a-z]" ];
header -> header [ label = "[a-z]"];
header -> sequence [ label = "\\\\n[ACGT]" ];
sequence -> sequence [ label = "[ACGT]" ];
sequence -> header [ label = "\\\\n>" ];
sequence -> end [ label = "\\\\nEOF" ];
begin -> end;
}
"""
display_machine()
How does the machine know when it's supposed to transition between states? Suppose a machine is at the beginning of a file and it begins by observing the input >
. Is this part of a transition to the header state, or the beginning of a malformed input?
It is certainly possible for the machine to have a memory of the pattern, such that it only transitions to the “header” state once it sees the two specific inputs >
and [a-z]
in that order. However, we can make things much simpler by requiring the machine to make a state transition for every single input byte, then we don't need any memory other than what state the machine is in, which is a single integer. To still represent the same format at the flowchart above, we need to insert some extra states:
s = """digraph {
graph [ rankdir = LR ];
begin [ shape = circle ];
2 [ shape = circle ];
1 [ shape = circle ];
3 [ shape = circle ];
begin -> 1 [ label = ">" ];
header [ shape = circle ];
1 -> header [label = "[a-z]" ];
header -> header [ label = "[a-z]" ];
2 -> sequence [ label = "[ACGT]" ];
header -> 2 [ label = "\\\\n" ];
sequence [ shape = circle ];
sequence -> sequence [ label = "[ACGT]" ];
sequence -> 3 [ label = "\\\\n" ];
3 -> 1 [ label = ">" ];
end [ shape = circle ];
3 -> end [ label = "EOF" ];
"begin" -> end [ label = "EOF" ];
}
"""
display_machine()
The diagram above is exactly equivalent to the previous one, but this diagram makes a state transition for every single input byte. All bytes cause a transition, and no transition consumes multiple bytes. For simplicity, Automa's parsers work like that.
Note that it is always possible, given a FSM with multi-byte transitions to convert it to a FSM with single-byte transitions by just putting in more nodes. E.g. the transition from header to sequence requires two bytes, a newline and an A, C, G or T. Therefore, an intermediate note (on the illustration marked “2”) is needed. Unfortunately, the requirement of single-byte transitions limits what you can do with Automa - I'll come back to the limitations later.
Now let's create this same FSM using Automa.
# First imports
import Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp;
We define the elements that make up the SimpleFasta (the header and the sequence) using Automa.RegExp
, a subset of regular expressions. Automa's regular expressions can be manipulated with the following operations:
Function
Alias
Meaning
cat(re...)
*
concatenation
alt(re1, re2...)
|
alternation
rep(re)
zero or more repetition
rep1(re)
one or more repetition
opt(re)
zero or one repetition
isec(re1, re2)
&
intersection
diff(re1, re2)
&bsol
| difference (subtraction) |
neg(re)
!
negation
Furthermore inside the re-strings themselves, you can use
[ACGT]
representing the set {A, C, G, T}
A-Z
representing 'A':'Z'
(or more accurately, the bytes UInt8('A'):UInt8('Z')
*
, representing zero or more repetitions of the last element, such as [a-z]*
+
, representing one or more repetitions of the last element, such as [a-z]+
?
, representing zero or one repetition of the last element, such as [a-z]?
We can specify our FSM like below. We use a let
statement only so we don't pollute global namespace with header
, sequence
etc.
machine = let
# Define SimpleFasta syntax
header = re">[a-z]+"
sequence = re"[ACGT]+"
record = header * re"\n" * sequence * re"\n"
fasta = re.rep(record)
# Compile the regex to a FSM
Automa.compile(fasta)
end;
If you have dot
installed on your computer, you can visualize the compiled FSM. You might want to modify the paths in the function below to make it work on your computer.
function display_machine(machine)
write("/tmp/fasta_machine.dot", Automa.machine2dot(machine))
run(`dot -Tsvg -o /tmp/fasta_machine.svg /tmp/fasta_machine.dot`)
open("/tmp/fasta_machine.svg") do file
display("image/svg+xml", String(read(file)))
end
end;
display_machine(machine)
Yes, that looks correct! The start node here is the small point on the left. You can see one of the states are represented by a double circle. This represents an accept state, where the machine may stop at a valid end of input - here after each SimpleFasta record. If the machine stops at any other state, it did not complete correctly. In other words, state “1” here doubles as the “end” state.
With this parser we can read a SimpleFasta file and determine if it conforms to the specification. In Automa, we do this by having our machine create Julia code in the form of a Julia expression (of type Expr
), which we can then use to create a function using metaprogramming.
First, we need an Automa.CodeGenContext
, an object which contains the options for code generation. For now, we don't worry about the settings and just make a default context object:
context = Automa.CodeGenContext();
Then, we use two functions to generate code:
Automa.generate_init_code(::CodeGenContext, ::Machine)
creates the Julia code needed to initialize the parsing of the given FSM.Automa.generate_exec_code(::CodeGenContext, ::Machine, ::Dict{Symbol, Expr})
creates the Julia code needed for the main loop of the running parser. Here, the Dict
is an action dict, which may contain arbitrary Julia code that the parser executes while parsing. We need that action dict later to make our parser do something useful - but for now, we simply want to create a parser, not necessarily a useful one, so we just use an empty dict. We will come back to the action dict later.Let's have a look at these functions:
Automa.generate_init_code(context, machine)
quote
#= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:94 =#
p::Int = 1
#= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:95 =#
p_end::Int = 0
#= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:96 =#
p_eof::Int = -1
#= /home/jakob/.julia/packages/Automa/sfHZ6/src/codegen.jl:97 =#
cs::Int = 1
end
Besides the comments, the init code seems to define four integers. What to they mean?
Automa works by reading the data byte by byte. When doing this, the data is represented in the function as an AbstractVector{UInt8}
- let us call that data
. p
represents the index of data
that the parser is currently reading from. p_end
represents the last position of the data in data
. Because data
may just be a small, buffered part of a much larger file being parsed, p_eof
represents the index of actual end of input. Last, cs
is short for “current state”, and is an integer representing the state of the FSM. You can see that sensibly, it is initialized at state 1, and so, by looking at the FSM diagram, we can see it next expects a ‘>’. The zero state is special, and represents the “accept state”, that is, the state of a correctly exited FSM.
The code generated by Automa.generate_exec_code
is a little more complicated, and also depends on the CodeGenContext
. But it will increment p
by one, read a byte from data
, and execute a state transition (i.e. change cs
) depending on the byte. If cs
goes negative, it knows the input is malformed. If cs
is zero, the parser has finished. If cs
is positive, it will continue until end of file.
Alright, let's create a parser using these functions!
@eval function parse_fasta(data::Union{String,Vector{UInt8}})
$(Automa.generate_init_code(context, machine))
# p_end and p_eof were set to 0 and -1 in the init code,
# we need to set them to the end of input, i.e. the length of `data`.
p_end = p_eof = lastindex(data)
# We just use an empty dict here because we don't want our parser
# to do anything just yet - only parse the input
$(Automa.generate_exec_code(context, machine, Dict{Symbol, Expr}()))
# We need to make sure that we reached the accept state, else the
# input did not parse correctly
iszero(cs) || error("failed to parse on byte ", p)
return nothing
end;
Does it work?
for (inputno, data) in enumerate([
"",
">hi\nACGT\n",
">hi\nACGT",
">hi\nAA\n>x\nGTGATCGTAGTA\n",
">hi\nthere!"
])
try
parse_fasta(data)
println("Input $inputno: parsed successfully!")
catch e
println("Input $inputno: ", e)
end
end
Input 1: parsed successfully!
Input 2: parsed successfully!
Input 3: ErrorException("failed to parse on byte 9")
Input 4: parsed successfully!
Input 5: ErrorException("failed to parse on byte 5")
Excellent. It correctly parsed the empty data, input 2 and 4. Input 3 lacked a trailing newline, and input 5 didn't have a proper sequence.
But just parsing a file is not too useful. Normally, we parse a file in order to load it into some kind of datastructure to manipulate it. That's where the action dict comes into the picture. We can make Automa execute arbitrary Julia code during parsing. Here is how it works:
Symbol
s to certain regexes. For example, we may define that exiting the regex sequence
should execute the action :exit_sequence
. Like other Julia identifiers, the name can be chosen arbitrarily.Dict{Symbol, Expr}
, we can map action names to Julia code. This tells Automa what each action name means.First, we redefine our machine. This time, we add actions:
machine = let
# Define SimpleFasta syntax
newline = re"\n"
header = re">[a-z]+"
sequence = re"[ACGT]+"
record = header * newline * sequence * newline
fasta = re.rep(record)
# Define SimpleFasta actions using arbitrary names
header.actions[:enter] = [:enter]
header.actions[:exit] = [:exit_header]
sequence.actions[:enter] = [:enter]
sequence.actions[:exit] = [:exit_sequence]
record.actions[:exit] = [:exit_record]
Automa.compile(fasta)
end;
If we visualize the machine now, the actions will be printed on the relevant edges in the FSM diagram. For example, the newline after a header is labeled '\n'/exit_header
, meaning the transition will happen at the input byte '\n'
, and it will execute the action exit_header
.
Also note that the diagram structure itself changed, because the entering state and exit state are now distinct. The exit state 6 will execute :exit_record
at the end of file (EOF), whereas if the FSM ends at state 1, :exit_record
is not executed.
display_machine(machine)
Let's also define a simple data structure to parse the SimpleFasta records into. Like our SimpleFasta format, let's not overcomplicate things:
using BioSequences
struct FastaRecord{A}
header::String
sequence::LongSequence{A}
end
FastaRecord(h, sequence::LongSequence{A}) where A = FastaRecord{A}(h, sequence);
And now our action dict. I can use the variables data
, p
, p_end
etc. in the code:
actions = Dict(
# Record which byte position marks the beginning of header or sequence
:enter => quote
mark = p
end,
:exit_header => quote
header = String(data[mark+1:p-1])
end,
:exit_sequence => quote
sequence = LongSequence{A}(data[mark:p-1])
end,
:exit_record => quote
record = FastaRecord(header, sequence)
push!(records, record)
end,
);
@eval function parse_fasta(::Type{A}, data::Union{String,Vector{UInt8}}) where {A <: Alphabet}
# We can't control the order that the code from out action dict is executed, so
# we define variables beforehand so we can be sure they are not used before
# they are defined.
mark = 0
records = FastaRecord{A}[]
header = ""
sequence = LongSequence{A}()
# The rest is just like in our previous example, except now we use the
# action dict we just created.
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine, actions))
iszero(cs) || error("failed to parse on byte ", p)
return records
end;
We can verify that it works:
parse_fasta(DNAAlphabet{4}, ">hello\nTAG\n>there\nTAA\n")
2-element Array{FastaRecord{DNAAlphabet{4}},1}:
FastaRecord{DNAAlphabet{4}}("hello", TAG)
FastaRecord{DNAAlphabet{4}}("there", TAA)
With the basics now covered, let's step up the game a tad and make it parse SimpleFasta records with multiple lines per sequence. We first define a seqline
pattern that looks like the old sequence
pattern, and build a new sequence
pattern from multiple seqline
patterns.
machine = let
# Define SimpleFasta syntax
header = re">[a-z]+"
seqline = re"[ACGT]+"
sequence = seqline * re.rep(re"\n" * seqline)
record = header * re"\n" * sequence * re"\n"
fasta = re.rep(record)
# Define SimpleFasta actions
header.actions[:enter] = [:enter]
header.actions[:exit] = [:exit_header]
seqline.actions[:enter] = [:enter]
seqline.actions[:exit] = [:exit_seqline]
record.actions[:exit] = [:exit_record]
Automa.compile(fasta)
end;
The machine now has a subtle change with a small loop between node 5 and 6 representing the parser seeing multiple sequence lines.
display_machine(machine)
We also need to update the actions
. Here, I use :(this syntax)
, which is equivalent to
quote
this syntax
end
but more readable for one-liners. I keep a buffer containing the sequence, and append every sequence line to the buffer at the end of each seqline
. When the record is done, I create the sequence and empty the buffer for the next record.
actions = Dict(
:enter => :(mark= p),
:exit_header => :(header = String(data[mark+1:p-1])),
:exit_seqline => quote
doff = length(seqbuffer) + 1
resize!(seqbuffer, length(seqbuffer) + p - mark)
copyto!(seqbuffer, doff, data, mark, p-mark)
end,
:exit_record => quote
sequence = LongSequence{A}(seqbuffer)
empty!(seqbuffer)
record = FastaRecord(header, sequence)
push!(records, record)
end,
);
We need to redefine parse_fasta
, now also containing the seqbuffer
:
@eval function parse_fasta(::Type{A}, data::Union{String,Vector{UInt8}}) where {A <: Alphabet}
mark = 0
records = FastaRecord{A}[]
header = ""
sequence = LongSequence{A}()
seqbuffer = UInt8[]
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine, actions))
iszero(cs) || error("failed to parse on byte ", p)
return records
end;
And we can confirm it now works for multiple sequences!
parse_fasta(DNAAlphabet{2}, ">hello\nTAG\nGGG\n>there\nTAA\nTAG\n")
2-element Array{FastaRecord{DNAAlphabet{2}},1}:
FastaRecord{DNAAlphabet{2}}("hello", TAGGGG)
FastaRecord{DNAAlphabet{2}}("there", TAATAG)
One of the greatest things about Automa is how we can parse quite complicated formats without needing to manually construct much code. For example, here, I change only the regular expressions:
\r\n
)re.space() \ (re"\r" | re"\n")
means all space characters except
machine = let
# Define FASTA syntax
newline = re.opt("\r") * re"\n"
hspace = re.space() \ (re"\r" | re"\n")
header = re">" * re.rep1((re.any() \ re.space())) * re.opt(hspace)
seqline = re"[acgturmkyswbdhvnACGTURMKYSWBDHVN\-*]+" * re.opt(hspace)
sequence = seqline * re.rep(newline * seqline)
record = header * newline * sequence * newline
trailing_record = header * newline * sequence * re.rep(newline * re.opt(hspace))
fasta = re.rep(newline) * re.rep(record) * re.opt(trailing_record)
# Define FASTA actions
header.actions[:enter] = [:enter]
header.actions[:exit] = [:exit_header]
seqline.actions[:enter] = [:enter]
seqline.actions[:exit] = [:exit_seqline]
record.actions[:exit] = [:exit_record]
trailing_record.actions[:exit] = [:exit_record]
Automa.compile(fasta)
end;
The machine is now a bit more complicated. But who cares, it's automatically generated. Look how easy it was to generate!
display_machine(machine)
Besides its complexity, building parsers with Automa has some pitfalls - notably those caused by Automa transitioning state for every input byte.
Unfortunately, it's very easy to accidentally create a machine that can't possibly figure out what action to take when looking only at one byte at a time. The following simple machine will not compile (on the master branch of Automa).
bad_machine = let
left = re"Turn left!"
right = re"Turn right!"
left_or_right = left | right
left.actions[:enter] = [:turn_left]
right.actions[:enter] = [:turn_right]
Automa.compile(left_or_right)
end
Ambiguous DFA: Input 0x54 can lead to actions [:turn_right] or [:turn_left]
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] validate_paths(::Array{Tuple{Union{Nothing, Automa.Edge},Array{Symbol,1}},1}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:176
[3] validate_nfanodes(::Automa.StableSet{Automa.NFANode}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:195
[4] nfa2dfa(::Automa.NFA, ::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:104
[5] compile(::Automa.RegExp.RE; optimize::Bool, unambiguous::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55
[6] compile(::Automa.RegExp.RE) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55
[7] top-level scope at In[27]:9
[8] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091
Automa refuse to create this FSM. Why? Well, the first byte it expects is a T
(or 0x54
) - but there is no way of knowing whether it's entering left
or right
when it sees that byte, and these two patters have conflicting actions! If they had the same actions, there would be no problem.
This situation is highly artificial, but the situation happens often in real life. Here's a very subtle change to one of the previous parsers:
bad_machine = let
# Define SimpleFasta syntax
header = re">[a-z]+"
seqline = re"[ACGT]+"
sequence = seqline * re.rep(re"\n" * seqline)
record = header * re"\n" * sequence
fasta = re.rep(record * re"\n")
# Define SimpleFasta actions
header.actions[:enter] = [:enter]
header.actions[:exit] = [:exit_header]
seqline.actions[:enter] = [:enter]
seqline.actions[:exit] = [:exit_seqline]
record.actions[:exit] = [:exit_record]
Automa.compile(fasta)
end;
Ambiguous DFA: Input 0x0a can lead to actions [:exit_seqline, :exit_record] or [:exit_seqline]
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] validate_paths(::Array{Tuple{Union{Nothing, Automa.Edge},Array{Symbol,1}},1}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:176
[3] validate_nfanodes(::Automa.StableSet{Automa.NFANode}) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:195
[4] nfa2dfa(::Automa.NFA, ::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/dfa.jl:104
[5] compile(::Automa.RegExp.RE; optimize::Bool, unambiguous::Bool) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55
[6] compile(::Automa.RegExp.RE) at /home/jakob/.julia/packages/Automa/sfHZ6/src/machine.jl:55
[7] top-level scope at In[28]:16
[8] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091
After a seqline
, when encountering a newline, there is no way of knowing whether this marks the end of a record or whether it continues at the next line. Automa can't look ahead, it has to make a decision at every byte.
Here, the solution is to move the newline from the definition of fasta
to that of sequence
. That way, the newline functions as a kind of one-byte lookahead - if the byte after the newline is a >
, it knows the record is complete.
In general, encountering situations like this is usually a sign that the parser is badly built - or that the format is. You can usually resolve it by using single-byte lookahead like above. If not, it is possible to optionally disable the check for ambiguous actions by passing a keyword to the Automa.parse
function. But beware that this may lead to absurd results.
Some formats are naturally recursive. The Newick format, for example, defines “subtree” in terms of “internal”, which is defined in terms of “branchset”, defined in terms of “branch” which is defined in terms of… “subtree”.
Something like that will not work Automa:
simple_newick = let
name = re"[a-z_]"
clade = name | (re"\(" * cladeset * ")")
cladeset = clade * re.rep(re"," * clade)
Automa.compile(cladeset * re";")
end
UndefVarError: cladeset not defined
Stacktrace:
[1] top-level scope at In[29]:3
[2] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091
Because, in general, you can't refer to objects in Julia that have not yet been defined. Worse, even if the syntactical challenge was solved, recursive patterns usually only make sense if you have lookahead. Newick files, for example, simply can't be parsed by FSMs, because every time you see an open parenthesis, you need to make sure there is a closing one further in the file, and this requires lookahead that can't be resolved by parsing one byte at a time.
Luckily, because Automa can execute arbitrary Julia code, we can have the FSM manipulate a stack and turn the FSM into a pushdown automaton, which can handle formats like Newick. It does require a little more manual fiddling, and is less elegant:
simple_newick = let
name = re"[a-z_]+"
cladestart = re"\("
cladestop = re"\)"
cladesep = re","
nodechange = cladestart | cladestop | cladesep
newick_element = nodechange * re.opt(name)
tree = re.opt(name) * re.rep(newick_element) * re";"
cladestart.actions[:enter] = [:push]
cladestop.actions[:enter] = [:pop]
cladesep.actions[:enter] = [:pop, :push]
Automa.compile(tree)
end
actions = Dict(
:push => quote
level -= 1
end,
:pop => quote
iszero(level) && error("Too many closing parentheses")
level += 1
end
);
@eval function parse_newick(data::Union{String,Vector{UInt8}})
level = 0
$(Automa.generate_init_code(context, simple_newick))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, simple_newick, actions))
iszero(cs) || error("failed to parse on byte ", p)
iszero(level) || error("Too many opening parentheses")
end;
And it sort of works:
println("Does this parse? ", parse_newick("(hello,hi);"))
println("Does this parse? ", parse_newick("(hello,hi)(;"))
Does this parse? true
Too many opening parentheses
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] parse_newick(::String) at ./In[31]:8
[3] top-level scope at In[32]:2
[4] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091
Here, I needed to manually keep track of the level of nesting. In fact, I would have to keep track of more stuff to disallow inputs like ();
, removing some of the point of Automa - namely, that it is automatic.
In the next part of this tutorial, I will show how to use Automa to create honest-to-God parsers which integrate with the BioJulia ecosystem and can work on streamed data.
Automa's parsers are pretty fast. To create hand-written parsers of comparable speeds, you need to microoptimize every operation, which is annoying. That being said, for actually loading parsed files to be fast, you need to optimize the user-defined actions in the actions dictionary, too.
How fast is our un-optimized Fasta parser already? Let's test it on some data and find out. I'll make 50k sequences with 2k base pairs each, for a total of 100 MB data.
Remember to re-run the block of code where the parse_fasta
function is defined since we changed the machine.
# Create 50k records with 2kbp each
function generate_data()
open("/tmp/fasta.fna", "w") do file
for seq in 1:50000
println(file, '>', join(rand('a':'z', 16)))
for line in 1:20
println(file, join(rand("ACGT", 100)))
end
end
end
end
generate_data()
function parsedata()
open("/tmp/fasta.fna") do file
parse_fasta(DNAAlphabet{2}, read(file))
end
end;
parsedata();
@time parsedata();
0.330489 seconds (200.04 k allocations: 138.861 MiB, 4.15% gc time)
Alright! It does about 300 MB/s on my laptop. Not bad! I'd say for most applications, this speed is more than sufficient.
But this is Julia. We want to be able to optimize the crap out of it. So let's optimize the actions:
actions = Dict(
:enter => :(mark = p),
# Create string with only one copy of the data.
:exit_header => :(header = unsafe_string(pointer(data, mark + 1), p - mark - 1)),
# Only resize buffer if it's too small, else just keep track of how many bytes are used.
:exit_seqline => quote
N = p - mark
length(seqbuffer) < filled + N && resize!(seqbuffer, filled + N)
copyto!(seqbuffer, filled+1, data, mark, N)
filled += N
end,
# Use `copyto!` to only encode data directly from data vector into sequence
# note this requires the latest versions of BioSequences.
:exit_record => quote
sequence = copyto!(LongSequence{A}(filled), 1, seqbuffer, 1, filled)
record = FastaRecord(header, sequence)
push!(records, record)
filled = 0
end,
);
If we want to optimize, we also need to use the fastest CodeGenContext
. We use the :goto
-generator. This creates machine code using @goto
-statements, which creates very long and completely unreadable code - but it's fast. Also, who cares if it's hard to read. It's machine-generated code. We disable the FSM's boundschecks when accessing the data
vector - since we don't manipulate the position p
manually, we are confident it will not go out of bounds. And we allow the code generator to unroll loops:
context = Automa.CodeGenContext(generator=:goto, checkbounds=false, loopunroll=4);
We also need to modify the parse
function because we use a new variable called filled
and need to initialize it.
@eval function parse_fasta(::Type{A}, data::Union{String,Vector{UInt8}}) where {A <: Alphabet}
mark = 0
records = FastaRecord{A}[]
header = ""
sequence = LongSequence{A}()
seqbuffer = UInt8[]
filled = 0
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine, actions))
iszero(cs) || error("failed to parse on byte ", p)
return records
end;
parsedata();
@time parsedata();
0.162113 seconds (150.04 k allocations: 133.520 MiB, 7.90% gc time)
ERROR: ParseError:
# Error @ string:1:9
0.162113 seconds (150.04 k allocations: 133.520 MiB, 7.90% gc time)
# └─────────────────────────────────────────────────────────┘ ── extra tokens after end of expression
Stacktrace:
Okay, we're at slightly above 600 MB/s. Profiling confirms nearly all time is spent encoding the DNA sequences to LongSequence
. We can make it faster still by not storing the data as a LongSequence
, and instead make it a "lazy" object that only constructs the sequence upon demand. Even further, we could avoid heap-allocating each sequence individually. But these optimizations do not pertain directly to Automa, and I will leave them here.