Regex
Automa regex (of the type Automa.RE
) are conceptually similar to the Julia built-in regex. They are made using the @re_str
macro, like this: re"ABC[DEF]"
.
Automa regex matches individual bytes, not characters. Hence, re"Æ"
(with the UTF-8 encoding [0xc3, 0x86]
) is equivalent to re"\xc3\x86"
, and is considered the concatenation of two independent input bytes.
The @re_str
macro supports the following content:
- Literal symbols, such as
re"ABC"
,re"\xfe\xa2"
orre"Ø"
|
for alternation, as inre"A|B"
, meaning "A
orB
".- Byte sets with
[]
, likere"[ABC]"
. This means any of the bytes in the brackets, e.g.re"[ABC]"
is equivalent tore"A|B|C"
. - Inverted byte sets, e.g.
re"[^ABC]"
, meaning any byte, except those inre[ABC]
. - Repetition, with
X*
meaning zero or more repetitions of X +
, whereX+
meansXX*
, i.e. 1 or more repetitions of X?
, whereX?
meansX | ""
, i.e. 0 or 1 occurrences of X. It applies to the last element of the regex- Parentheses to group expressions, like in
A(B|C)?
You can combine regex with the following operations:
*
for concatenation, withre"A" * re"B"
being the same asre"AB"
. Regex can also be concatenated withChar
s andString
s, which will cause the chars/strings to be converted to regex first.|
for alternation, withre"A" | re"B"
being the same asre"A|B"
&
for intersection of regex, i.e. for regexA
andB
, the set of inputs matchingA & B
is exactly the intersection of the inputs matchA
and those matchingB
. As an example,re"A[AB]C+D?" & re"[ABC]+"
isre"ABC"
.\
for difference, such that for regexA
andB
,A \ B
creates a new regex matching all those inputs that matchA
but notB
.!
for inversion, such that!re"[A-Z]"
matches all other strings than those which matchre"[A-Z]"
. Note that!re"a"
also matches e.g."aa"
, since this does not matchre"a"
.
Finally, the funtions opt
, rep
and rep1
is equivalent to the operators ?
, *
and +
, so i.e. opt(re"a" * rep(re"b") * re"c")
is equivalent to re"(ab*c)?"
.
Example
Suppose we want to create a regex that matches a simplified version of the FASTA format. This "simple FASTA" format is defined like so:
- The format is a series of zero or more records, concatenated
- A record consists of the concatenation of:
- A leading '>'
- A header, composed of one or more letters in 'a-z',
- A newline symbol '\n'
- A series of one or more sequence lines
- A sequence line is the concatenation of:
- One or more symbols from the alphabet [ACGT]
- A newline
We can represent this concisely as a regex: re"(>[a-z]+\n([ACGT]+\n)+)*"
To make it easier to read, we typically construct regex incrementally, like such:
fasta_regex = let
header = re"[a-z]+"
seqline = re"[ACGT]+"
record = '>' * header * '\n' * rep1(seqline * '\n')
rep(record)
end
@assert fasta_regex isa RE
Reference
Automa.RegExp.RE
— TypeRE(s::AbstractString)
Automa regular expression (regex) that is used to match a sequence of input bytes. Regex should preferentially be constructed using the @re_str
macro: re"ab+c?"
. Regex can be combined with other regex, strings or chars with *
, |
, &
and \
:
a * b
matches inputs that matches firsta
, thenb
a | b
matches inputs that matchesa
orb
a & b
matches inputs that matchesa
andb
a \ b
matches input that mathesa
but notb
!a
matches all inputs that does not matcha
.
Set actions to regex with onenter!
, onexit!
, onall!
and onfinal!
, and preconditions with precond!
.
Example
julia> regex = (re"a*b?" | opt('c')) * re"[a-z]+";
julia> regex = rep1((regex \ "aba") & !re"ca");
julia> regex isa RE
true
julia> compile(regex) isa Automa.Machine
true
See also: [@re_str](@ref)
, [@compile](@ref)
Automa.RegExp.@re_str
— Macro@re_str -> RE
Construct an Automa regex of type RE
from a string. Note that due to Julia's raw string escaping rules, re"\\"
means a single backslash, and so does re"\\\\"
, while re"\\\\\""
means a backslash, then a quote character.
Examples:
julia> re"ab?c*[def][^ghi]+" isa RE
true
See also: RE