Text validators
The simplest use of Automa is to simply match a regex. It's unlikely you are going to want to use Automa for this instead of Julia's built-in regex engine PCRE, unless you need the extra performance that Automa brings over PCRE. Nonetheless, it serves as a good starting point to introduce Automa.
Suppose we have the FASTA regex from the regex page:
julia> fasta_regex = let
header = re"[a-z]+"
seqline = re"[ACGT]+"
record = '>' * header * '\n' * rep1(seqline * '\n')
rep(record)
end;
Buffer validator
Automa comes with a convenience function generate_buffer_validator
:
Given a regex (RE
) like the one above, we can do:
julia> eval(generate_buffer_validator(:validate_fasta, fasta_regex));
julia> validate_fasta
validate_fasta (generic function with 1 method)
And we now have a function that checks if some data matches the regex:
julia> validate_fasta(">hello\nTAGAGA\nTAGAG") # missing trailing newline
0
julia> validate_fasta(">helloXXX") # Error at byte index 7
7
julia> validate_fasta(">hello\nTAGAGA\nTAGAG\n") # nothing; it matches
IO validators
For large files, having to read the data into a buffer to validate it may not be possible. Automa also supports creating IO validators with the generate_io_validator
function:
This works very similar to generate_buffer_validator
, but the generated function takes an IO
, and has a different return value:
- If the data matches, still return
nothing
- Else, return (byte, (line, column)) where byte is the first errant byte, and (line, column) the position of the byte. If the errant byte is a newline, column is 0. If the input reaches unexpected EOF, byte is
nothing
, and (line, column) points to the last line/column in the IO:
julia> eval(generate_io_validator(:validate_io, fasta_regex));
julia> validate_io(IOBuffer(">hello\nTAGAGA\n"))
julia> validate_io(IOBuffer(">helX"))
(0x58, (1, 5))
julia> validate_io(IOBuffer(">hello\n\n"))
(0x0a, (3, 0))
julia> validate_io(IOBuffer(">hello\nAC"))
(nothing, (2, 2))
Reference
Automa.generate_buffer_validator
— Functiongenerate_buffer_validator(name::Symbol, regexp::RE; goto=true; docstring=true)
Generate code that, when evaluated, defines a function named name
, which takes a single argument data
, interpreted as a sequence of bytes. The function returns nothing
if data
matches Machine
, else the index of the first invalid byte. If the machine reached unexpected EOF, returns 0
. If goto
, the function uses the faster but more complicated :goto
code. If docstring
, automatically create a docstring for the generated function.
Automa.generate_io_validator
— Functiongenerate_io_validator(funcname::Symbol, regex::RE; goto::Bool=false)
NOTE: This method requires TranscodingStreams to be loaded
Create code that, when evaluated, defines a function named funcname
. This function takes an IO
, and checks if the data in the input conforms to the regex, without executing any actions. If the input conforms, return nothing
. Else, return (byte, (line, col))
, where byte
is the first invalid byte, and (line, col)
the 1-indexed position of that byte. If the invalid byte is a byte,
col
is 0 and the line number is incremented. If the input errors due to unexpected EOF, byte
is nothing
, and the line and column given is the last byte in the file. If goto
, the function uses the faster but more complicated :goto
code.
Automa.compile
— Functioncompile(re::RE; optimize::Bool=true, unambiguous::Bool=true)::Machine
Compile a finite state machine (FSM) from re
. If optimize
, attempt to minimize the number of states in the FSM. If unambiguous
, disallow creation of FSM where the actions are not deterministic.
Examples
machine = let
name = re"[A-Z][a-z]+"
first_last = name * re" " * name
last_first = name * re", " * name
compile(first_last | last_first)
end
compile(tokens::Vector{RE}; unambiguous=false)::TokenizerMachine
Compile the regex tokens
to a tokenizer machine. The machine can be passed to make_tokenizer
.
The keyword unambiguous
decides which of multiple matching tokens is emitted: If false
(default), the longest token is emitted. If multiple tokens have the same length, the one with the highest index is returned. If true
, make_tokenizer
will error if any possible input text can be broken ambiguously down into tokens.
See also: Tokenizer
, make_tokenizer
, tokenize