Tokenizers (lexers)
A tokenizer or a lexer is a program that breaks down an input text into smaller chunks, and classifies them as one of several tokens. For example, consider an imagininary format that only consists of nested tuples of strings containing letters, like this:
(("ABC", "v"),(("x", ("pj",(("a", "k")), ("L")))))
Any text of this format can be broken down into a sequence of the following tokens:
- Left parenthesis:
re"\("
- Right parenthesis:
re"\)"
- Comma:
re","
- Quote:
re"\""
- Spaces:
re" +"
- Letters:
re"[A-Za-z]+"
Such that e.g. ("XY", "A")
can be represented as the token sequence lparens, quote, XY, quote, comma, space, quote A quote rparens
.
Breaking the text down to its tokens is called tokenization or lexing. Note that lexing in itself is not sufficient to parse the format: Lexing is context unaware and doesn't understand syntax, so e.g. the text "((A
can be perfectly well tokenized to quote lparens lparens A
, even if it's invalid syntax.
The purpose of tokenization is to make subsequent parsing easier, because each part of the text has been classified. That makes it easier to, for example, search for letters in the input. Instead of having to muck around with regex to find the letters, you use regex once to classify all text.
Making and using a tokenizer
Let's use the example above to create a tokenizer. The most basic tokenizer defaults to using UInt32
as tokens: You pass in a list of regex matching each token, then evaluate the resulting code:
julia> make_tokenizer(
[re"\(", re"\)", re",", re"\"", re" +", re"[a-zA-Z]+"]
) |> eval
The make_tokenizer
function creates Julia code (an Expr
object) that, when evaluated, defines Base.iterate
for the Tokenizer
type. The code above defined Base.iterate(::Tokenizer{UInt32, D, 1}) where D
- we'll get back to the different type parameters of Tokenizer
later.
Tokenizer
s are most easily created with the tokenize
function. To create a Tokenizer{UInt32}
, we can do call tokenize(UInt32, data)
:
julia> iterator = tokenize(UInt32, """("XY", "A")"""); typeof(iterator)
Tokenizer{UInt32, String, 1}
Meaning: A Tokenizer
emitting UInt32
over String
data, of version 1
. Since we used make_tokenizer
above to define iteration for this kind of tokenizer (one with UInt32
tokens), we can iterate this tokenizer.
When we iterate, we get Tuple{Int64, Int32, UInt32}
elements, with each element being:
- The start index of the token
- The length of the token
- The token itself, in this example
UInt32(1)
for '(',UInt32(2)
for ')' etc:
julia> collect(iterator)
10-element Vector{Tuple{Int64, Int32, UInt32}}:
(1, 1, 0x00000001)
(2, 1, 0x00000004)
(3, 2, 0x00000006)
(5, 1, 0x00000004)
(6, 1, 0x00000003)
(7, 1, 0x00000005)
(8, 1, 0x00000004)
(9, 1, 0x00000006)
(10, 1, 0x00000004)
(11, 1, 0x00000002)
The type of the last element in each tuple comes from the Tokenizer
type parameter: We specified UInt32
, so we get UInt32
tokens.
Any data which could not be tokenized is given the error token UInt32(0)
:
julia> collect(tokenize(UInt32, "XY!!)"))
3-element Vector{Tuple{Int64, Int32, UInt32}}:
(1, 2, 0x00000006)
(3, 2, 0x00000000)
(5, 1, 0x00000002)
Both tokenize
and make_tokenizer
take an optional argument version
, which is 1
by default. This sets the last parameter of the Tokenizer
struct - for example, make_tokenizer(tokens::Vector{RE}; version=5)
defines Base.iterate
for Tokenizer{UInt32, D, 5}
.
By letting the user freely choose the value of the last type parameter, this allows you to create multiple different tokenizers with the same element type.
Using enums as tokens
Using UInt32
as tokens is not very convenient - so it's possible to use enums to create the tokenizer:
julia> @enum Token error lparens rparens comma quot space letters
julia> make_tokenizer((error, [
lparens => re"\(",
rparens => re"\)",
comma => re",",
quot => re"\"",
space => re" +",
letters => re"[a-zA-Z]+"
])) |> eval
julia> collect(tokenize(Token, "XY!!)"))
3-element Vector{Tuple{Int64, Int32, Token}}:
(1, 2, letters)
(3, 2, error)
(5, 1, rparens)
To make it even easier, you can define the enum and the tokenizer in one go:
tokens = [
:lparens => re"\(",
:rparens => re"\)",
:comma => re",",
:quot => re"\"",
:space => re" +",
:letters => re"[a-zA-Z]+"
]
@eval @enum Token error $(first.(tokens)...)
make_tokenizer((error,
[Token(i) => j for (i,j) in enumerate(last.(tokens))]
)) |> eval
Token disambiguation
It's possible to create a tokenizer where the different token regexes overlap:
julia> make_tokenizer([re"[ab]+", re"ab*", re"ab"]) |> eval
In this case, an input like ab
will match all three regex. Which tokens are emitted is determined by two rules:
First, the emitted tokens will be as long as possible. So, the input aa
could be emitted as one token of the regex re"[ab]+"
, two tokens of the same regex, or of two tokens of the regex re"ab*"
. In this case, it will be emitted as a single token of re"[ab]+"
, since that will make the first token as long as possible (2 bytes), whereas the other options would only make it 1 byte long.
Second, tokens with a higher index in the input array beats previous tokens. So, a
will be emitted as re"ab*"
, as its index of 2 beats the previous regex re"[ab]+"
with the index 1, and ab
will match the third regex.
If you don't want emitted tokens to depend on these priority rules, you can set the optional keyword unambiguous=true
in the make_tokenizer
function, in which case make_tokenizer
will error if any input text could be broken down into different tokens. However, note that this may cause most tokenizers to error when being built, as most tokenization processes are ambiguous.
Reference
Automa.Tokenizer
— TypeTokenizer{E, D, C}
Lazy iterator of tokens of type E
over data of type D
. Tokenizers are usually created with the tokenize
function, and their iterator behaviour are defined by make_tokenizer
.
Tokenizer
works on any buffer-like object that defines pointer
and sizeof
. When iterated, it will return a Tuple{Integer, Integer, E}
:
- The first value in the tuple is the 1-based starting index of the token in the buffer
- The second is the length of the token in bytes
- The third is the token.
Un-tokenizable data will be emitted as the "error token" which must also be of type E
.
The Int
parameter C
allows multiple tokenizers to be created with the otherwise same type parameters.
See also: make_tokenizer
Automa.tokenize
— Functiontokenize(::Type{E}, data, version=1) -> Tokenizer
Create a Tokenizer{E, typeof(data), version}
, iterating tokens of type E
over data
.
See also: Tokenizer
, make_tokenizer
, compile
Examples
julia> tokenize(UInt32, "hello")
Tokenizer{UInt32, String, 1}("hello")
julia> tokenize(Int8, [1, 2, 3], 3)
Tokenizer{Int8, Vector{Int64}, 3}([1, 2, 3])
Automa.make_tokenizer
— Functionmake_tokenizer(
machine::TokenizerMachine;
tokens::Tuple{E, AbstractVector{E}}= [ integers ],
goto=true, version=1
) where E
Create code which when evaluated, defines Base.iterate(::Tokenizer{E, D, $version})
. tokens
is a tuple of:
- the error token, which will be emitted for data that cannot be tokenized, and
- a vector of non-error tokens of length
machine.n_tokens
.
Most users should instead use the more convenient method make_tokenizer(tokens)
.
Example usage
julia> machine = compile([re"a", re"b"]);
julia> make_tokenizer(machine; tokens=(0x00, [0x01, 0x02])) |> eval
julia> iter = tokenize(UInt8, "abxxxba"); typeof(iter)
Tokenizer{UInt8, String, 1}
julia> collect(iter)
5-element Vector{Tuple{Int64, Int32, UInt8}}:
(1, 1, 0x01)
(2, 1, 0x02)
(3, 3, 0x00)
(6, 1, 0x02)
(7, 1, 0x01)
Any actions inside the input regexes will be ignored.
If goto
(default), use the faster, but more complex goto code generator.
The version
number will set the last parameter of the Tokenizer
, which allows you to create different tokenizers for the same element type.
make_tokenizer(
tokens::Union{
AbstractVector{RE},
Tuple{E, AbstractVector{Pair{E, RE}}}
};
goto::Bool=true,
version::Int=1,
unambiguous=false
) where E
Convenience function for both compiling a TokenizerMachine
, then running make_tokenizer
on it. In other words, this function returns code that when run, defines iterate
for Tokenizer{$E, D, $version} where D
.
If tokens are a tuple, the first element is the error token, and the next is a vector of token => regex
pairs. If tokens
is an AbstractVector{RE}
v
, the tokens defaults to UInt32
, and it behaves as if it was (UInt32(0), [UInt32(i)=>r for (i,r) in pairs(v)])
, i.e. UInt32(0)
is the error token.
Example
julia> make_tokenizer([re"abc", re"def"]) |> eval
julia> collect(tokenize(UInt32, "abcxyzdef123"))
4-element Vector{Tuple{Int64, Int32, UInt32}}:
(1, 3, 0x00000001)
(4, 3, 0x00000000)
(7, 3, 0x00000002)
(10, 3, 0x00000000)
julia> make_tokenizer((0, collect(pairs([re"x", re"y"]))); version=5) |> eval
julia> iter = tokenize(Int, "xyaby", 5)
Tokenizer{Int64, String, 5}("xyaby")
julia> collect(iter)
4-element Vector{Tuple{Int64, Int32, Int64}}:
(1, 1, 1)
(2, 1, 2)
(3, 2, 0)
(5, 1, 2)