Home
Automa.jl is a package for generating finite-state machines (FSMs) and tokenizers in Julia.
The following code is an example of tokenizing various kinds of numeric literals in Julia.
import Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp
# Describe regular expression patterns.
dec = re"[-+]?[0-9]+"
hex = re"0x[0-9A-Fa-f]+"
oct = re"0o[0-7]+"
prefloat = re"[-+]?([0-9]+\.[0-9]*|[0-9]*\.[0-9]+)"
float = prefloat | re.cat(prefloat | re"[-+]?[0-9]+", re"[eE][-+]?[0-9]+")
number = dec | hex | oct | float
numbers = re.cat(re.opt(number), re.rep(re" +" * number), re" *")
# Register action names to regular expressions.
number.actions[:enter] = [:mark]
int.actions[:exit] = [:dec]
hex.actions[:exit] = [:hex]
oct.actions[:exit] = [:oct]
float.actions[:exit] = [:float]
# Compile a finite-state machine.
machine = Automa.compile(numbers)
#= This generates a SVG file to visualize the state machine.
write("numbers.dot", Automa.dfa2dot(machine.dfa))
run(`dot -Tsvg -o numbers.svg numbers.dot`)
=#
# Bind an action code for each action name.
actions = Dict(
:mark => :(mark = p),
:dec => :(emit(:dec)),
:hex => :(emit(:hex)),
:oct => :(emit(:oct)),
:float => :(emit(:float)),
)
# Generate a tokenizing function from the machine.
@eval function tokenize(data::String)
tokens = Tuple{Symbol,String}[]
mark = 0
$(Automa.generate_init_code(machine))
p_end = p_eof = endof(data)
emit(kind) = push!(tokens, (kind, data[mark:p-1]))
$(Automa.generate_exec_code(machine, actions=actions))
return tokens, cs == 0 ? :ok : cs < 0 ? :error : :incomplete
end
tokens, status = tokenize("1 0x0123BEEF 0o754 3.14 -1e4 +6.022045e23")
Finally, space-separated numbers are tokenized as follows:
julia> tokens
6-element Array{Tuple{Symbol,String},1}:
(:dec,"1")
(:hex,"0x0123BEEF")
(:oct,"0o754")
(:float,"3.14")
(:float,"1e-4")
(:float,"+6.022045e23")
julia> status
:ok
Overview
Automa.jl is composed of three elements: regular expressions, compilers, and code generators. Regular expressions are used to specify patterns that you want to match and bind actions to. A regular expression can be built using APIs provided from the Automa.RegExp
module. The regular expression with actions is then fed to a compiler function that creates a finite state machine and optimizes it to minimize the number of states. Finally, the machine object is used to generate Julia code that can be spliced into functions.
Machines are byte-oriented in a sense that input data fed into a machine is a sequence of bytes. The generated code of a machine reads input data byte by byte and updates a current state variable based on transition rules defined by regular expressions. If one or more actions are associated to a state transition they will be executed before reading a next byte. If no transition rule is found for a byte of a specific state the machine sets the current state to an error value, stops executing, and breaks from a loop.
Regular expressions
Regular expressions in Automa.jl is somewhat more restricted than usual regular expressions in Julia. Some features like lookahead or backreference are not provided. In Automa.jl, re"..."
is used instead of r"..."
because these are different regular expressions. However, the syntax of Automa.jl's regular expressions is a subset of Julia's ones and hence it would be already familiar. Some examples are shown below:
decimal = re"[-+]?[0-9]+"
keyword = re"if|else|while|end"
identifier = re"[A-Za-z_][0-9A-Za-z_]*"
An important feature of regular expressions is composition of (sub-) regular expressions. One or more regular expressions can be composed using following functions:
Function | Alias | Meaning |
---|---|---|
cat(re...) | * | concatenation |
alt(re1, re2...) | | | alternation |
rep(re) | zero or more repetition | |
rep1(re) | one or more repetition | |
opt(re) | zero or one repetition | |
isec(re1, re2) | & | intersection |
diff(re1, re2) | \ | difference (subtraction) |
neg(re) | ! | negation |
Actions can be bind to regular expressions. Currently, there are three kinds of actions: enter, exit, and final. Enter actions will be executed when it enters the regular expression. In contrast, exit actions will be executed when it exits from the regular expression. Final actions will be executed every time when it reaches a final (or accept) state. The following code and figure demonstrate transitions and actions between states.
using Automa
using Automa.RegExp
const re = Automa.RegExp
a = re"a*"
b = re"b"
ab = a * b
a.actions[:enter] = [:enter_a]
a.actions[:exit] = [:exit_a]
a.actions[:final] = [:final_a]
b.actions[:enter] = [:enter_b]
b.actions[:exit] = [:exit_b]
b.actions[:final] = [:final_b]
Compilers
After finished defining a regular expression with optional actions you can compile it into a finite-state machine using the compile
function. The Machine
type is defined as follows:
type Machine
states::UnitRange{Int}
start_state::Int
final_states::Set{Int}
transitions::Dict{Int,Dict{UInt8,Tuple{Int,Vector{Symbol}}}}
eof_actions::Dict{Int,Vector{Symbol}}
dfa::DFA
end
For the purpose of debugging, Automa.jl offers the execute
function, which emulates the machine execution and returns the last state with the action log. Let's execute a machine of re"a*b"
with actions used in the previous example.
julia> machine = compile(ab)
Automa.Machine(<states=1:3,start_state=1,final_states=Set([0,2])>)
julia> Automa.execute(machine, "b")
(2,Symbol[:enter_a,:exit_a,:enter_b,:final_b,:exit_b])
julia> Automa.execute(machine, "ab")
(2,Symbol[:enter_a,:final_a,:exit_a,:enter_b,:final_b,:exit_b])
julia> Automa.execute(machine, "aab")
(2,Symbol[:enter_a,:final_a,:final_a,:exit_a,:enter_b,:final_b,:exit_b])
The Tokenizer
type is also a useful tool built on top of Machine
:
type Tokenizer
machine::Machine
actions_code::Vector{Tuple{Symbol,Expr}}
end
A tokenizer can be created using the compile
function as well but the argument types are different. When defining a tokenizer, compile
takes a list of pattern and action pairs as follows:
tokenizer = compile(
re"if|else|while|end" => :(emit(:keyword)),
re"[A-Za-z_][0-9A-Za-z_]*" => :(emit(:identifier)),
re"[0-9]+" => :(emit(:decimal)),
re"=" => :(emit(:assign)),
re"(" => :(emit(:lparen)),
re")" => :(emit(:rparen)),
re"[-+*/]" => :(emit(:operator)),
re"[\n\t ]+" => :(),
)
A tokenizer tries to find the longest token that is available from the current reading position. When multiple patterns match a substring of the same length, higher priority token placed at a former position in the arguments list will be selected. For example, "else"
matches both :keyword
and :identifier
but the :keyword
action will be run because it is placed before :identifier
in the arguments list.
Code generators
Once a machine or a tokenizer is created it's ready to generate Julia code using metaprogramming techniques. Here is an example to count the number of words in a string:
using Automa
using Automa.RegExp
const re = Automa.RegExp
word = re"[A-Za-z]+"
words = re.cat(re.opt(word), re.rep(re" +" * word), re" *")
word.actions[:exit] = [:word]
machine = compile(words)
actions = Dict(:word => :(count += 1))
# Generate a function using @eval.
@eval function count_words(data)
# initialize a result variable
count = 0
# generate code to initialize variables used by FSM
$(generate_init_code(machine))
# set end and EOF positions of data buffer
p_end = p_eof = endof(data)
# generate code to execute FSM
$(generate_exec_code(machine, actions=actions))
# check if FSM properly finished
if cs != 0
error("failed to count words")
end
return count
end
This will work as we expect:
julia> count_words("")
0
julia> count_words("The")
1
julia> count_words("The quick")
2
julia> count_words("The quick brown")
3
julia> count_words("The quick brown fox")
4
julia> count_words("A!")
ERROR: failed to count words
in count_words(::String) at ./REPL[10]:16
There are two functions that generate Julia and splice Julia code into a function. The first function is generate_init_code
, which generates some code to declare and initialize local variables used by FSM.
julia> generate_init_code(machine)
quote # /Users/kenta/.julia/v0.5/Automa/src/codegen.jl, line 13:
p::Int = 1 # /Users/kenta/.julia/v0.5/Automa/src/codegen.jl, line 14:
p_end::Int = 0 # /Users/kenta/.julia/v0.5/Automa/src/codegen.jl, line 15:
p_eof::Int = -1 # /Users/kenta/.julia/v0.5/Automa/src/codegen.jl, line 16:
cs::Int = 1
end
The input byte sequence is stored in the data
variable, which, in this case, is passed as an argument. The variable p
points at the next byte position in data
. p_end
points at the end position of data available in data
. p_eof
is similar to p_end
but it points at the actual end of the input sequence. In the example above, p_end
and p_eof
are soon set to sizeof(data)
because these two values can be determined immediately. p_eof
would be undefined when data
is too long to store in memory. In such a case, p_eof
is set to a negative integer at the beginning and later set to a suitable position when the end of an input sequence is seen. The cs
variable stores the current state of a machine.
The second function is generate_exec_code
, which generates a loop to emulate the FSM by updating cs
(current state) while reading bytes from data
. You don't need to care about the details of generated code because it is often too complicated to read for human. In short, the generated code tries to read as many bytes as possible from data
and stops when it reaches p_end
or when it fails transition.
After finished execution, the value stored in cs
indicates whether the execution successfully finished or not. cs == 0
means the FSM read all data and finished successfully. cs < 0
means it failed somewhere. cs > 0
means it is still in the middle of execution and needs more input data if any. The following snippet is a pseudocode of the machine execution:
# start main loop
while p ≤ p_end && cs > 0
l = {{ read a byte of `data` at position `p` }}
if {{ transferable from `cs` with `l` }}
cs = {{ next state of `cs` with `l` }}
{{ execute actions if any }}
else
cs = -cs
end
p += 1 # increment the position variable
end
if p_eof ≥ 0 && p > p_eof && cs ∈ machine.final_states
{{ execute EOF actions if any }}
cs = 0
elseif cs < 0
p -= 1 # point at the last read byte
end