Regex
Automa regex (of the type Automa.RE) are conceptually similar to the Julia built-in regex. They are made using the @re_str macro, like this: re"ABC[DEF]".
Automa regex matches individual bytes, not characters. Hence, re"Æ" (with the UTF-8 encoding [0xc3, 0x86]) is equivalent to re"\xc3\x86", and is considered the concatenation of two independent input bytes.
The @re_str macro supports the following content:
- Literal symbols, such as
re"ABC",re"\xfe\xa2"orre"Ø" |for alternation, as inre"A|B", meaning "AorB".- Byte sets with
[], likere"[ABC]". This means any of the bytes in the brackets, e.g.re"[ABC]"is equivalent tore"A|B|C". - Inverted byte sets, e.g.
re"[^ABC]", meaning any byte, except those inre[ABC]. - Repetition, with
X*meaning zero or more repetitions of X +, whereX+meansXX*, i.e. 1 or more repetitions of X?, whereX?meansX | "", i.e. 0 or 1 occurrences of X. It applies to the last element of the regex- Parentheses to group expressions, like in
A(B|C)?
You can combine regex with the following operations:
*for concatenation, withre"A" * re"B"being the same asre"AB". Regex can also be concatenated withChars andStrings, which will cause the chars/strings to be converted to regex first.|for alternation, withre"A" | re"B"being the same asre"A|B"&for intersection of regex, i.e. for regexAandB, the set of inputs matchingA & Bis exactly the intersection of the inputs matchAand those matchingB. As an example,re"A[AB]C+D?" & re"[ABC]+"isre"ABC".\for difference, such that for regexAandB,A \ Bcreates a new regex matching all those inputs that matchAbut notB.!for inversion, such that!re"[A-Z]"matches all other strings than those which matchre"[A-Z]". Note that!re"a"also matches e.g."aa", since this does not matchre"a".
Finally, the funtions opt, rep and rep1 is equivalent to the operators ?, * and +, so i.e. opt(re"a" * rep(re"b") * re"c") is equivalent to re"(ab*c)?".
Example
Suppose we want to create a regex that matches a simplified version of the FASTA format. This "simple FASTA" format is defined like so:
- The format is a series of zero or more records, concatenated
- A record consists of the concatenation of:
- A leading '>'
- A header, composed of one or more letters in 'a-z',
- A newline symbol '\n'
- A series of one or more sequence lines
- A sequence line is the concatenation of:
- One or more symbols from the alphabet [ACGT]
- A newline
We can represent this concisely as a regex: re"(>[a-z]+\n([ACGT]+\n)+)*" To make it easier to read, we typically construct regex incrementally, like such:
fasta_regex = let
header = re"[a-z]+"
seqline = re"[ACGT]+"
record = '>' * header * '\n' * rep1(seqline * '\n')
rep(record)
end
@assert fasta_regex isa REReference
Automa.RegExp.RE — TypeRE(s::AbstractString)Automa regular expression (regex) that is used to match a sequence of input bytes. Regex should preferentially be constructed using the @re_str macro: re"ab+c?". Regex can be combined with other regex, strings or chars with *, |, & and \:
a * bmatches inputs that matches firsta, thenba | bmatches inputs that matchesaorba & bmatches inputs that matchesaandba \ bmatches input that mathesabut notb!amatches all inputs that does not matcha.
Set actions to regex with onenter!, onexit!, onall! and onfinal!, and preconditions with precond!.
Example
julia> regex = (re"a*b?" | opt('c')) * re"[a-z]+";
julia> regex = rep1((regex \ "aba") & !re"ca");
julia> regex isa RE
true
julia> compile(regex) isa Automa.Machine
trueSee also: [@re_str](@ref), [@compile](@ref)
Automa.RegExp.@re_str — Macro@re_str -> REConstruct an Automa regex of type RE from a string. Note that due to Julia's raw string escaping rules, re"\\" means a single backslash, and so does re"\\\\", while re"\\\\\"" means a backslash, then a quote character.
Examples:
julia> re"ab?c*[def][^ghi]+" isa RE
true See also: RE