Lookahead parsing
Lookahead parsing is a core Spicy concept. Leveraging lookahead makes it possible to build concise grammars which remain comprehensible and maintainable as the grammar grows.
Deep dive: Parsing of vector of unknown size
We have already seen how we can use lookahead parsing to dynamically detect the length of a vector.
type X = unit {
: (b"A")[]; # Extract unknown number of literal 'A' bytes.
x: uint8;
};
We can view the generated parser by requesting grammar debug output from Spicy's
spicyc
compiler.
$ spicyc -D grammar x.spicy -o /dev/null -p
# ~~~~~~~~~~ ~~~~~~~ ~~~~~~~~~~~~ ~~
# | | | |
# | | | - emit generated IR
# | | |
# | | - redirect output of generated code to /dev/null
# | |
# | - compile file 'x.spicy'
# |
# - emit 'grammar' debug stream to stderr
[debug/grammar] === Grammar foo::X
[debug/grammar] Epsilon: <epsilon> -> ()
[debug/grammar] While: anon -> while(<look-ahead-found>): anon_2 [field: anon (*)] [item-type: vector<bytes>] [parse-type: vector<bytes>]
[debug/grammar] Ctor: anon_2 -> b"A" (bytes) (container 'anon') [field: anon_2 (*)] [item-type: bytes] [parse-type: bytes]
[debug/grammar] LookAhead: anon_l1 -> {uint<8> (not a literal)}: <epsilon> | {b"A" (bytes) (id 1)}: anon_l2
[debug/grammar] Sequence: anon_l2 -> anon_2 anon_l1
[debug/grammar] (*) Unit: foo_X -> anon x
[debug/grammar] Variable: x -> uint<8> [field: x (*)] [item-type: uint<8>] [parse-type: uint<8>]
[debug/grammar]
[debug/grammar] -- Epsilon:
[debug/grammar] anon = true
[debug/grammar] anon_l1 = true
[debug/grammar] anon_l2 = false
[debug/grammar] foo_X = false
[debug/grammar]
[debug/grammar] -- First_1:
[debug/grammar] anon = { anon_2 }
[debug/grammar] anon_l1 = { anon_2 }
[debug/grammar] anon_l2 = { anon_2 }
[debug/grammar] foo_X = { anon_2, x }
[debug/grammar]
[debug/grammar] -- Follow:
[debug/grammar] anon = { x }
[debug/grammar] anon_l1 = { x }
[debug/grammar] anon_l2 = { x }
[debug/grammar] foo_X = { }
[debug/grammar]
In above debug output the entry point of the grammar is marked (*)
.
- parsing a unit consists of parsing the
anon
field (corresponding to the anonymous vector), andx
- to parse the vector lookahead is used.
- lookahead inspects a
uint8
(as epsilon) or literalb"A"
Types for lookahead
In addition to literals, lookahead also works with units which start with a literal. Spicy transparently detects such units and will use them for lookahead if possible.
Confirm this yourself by wrapping the literal in above unit in its own unit, and
validating by parsing an input like AAAAA\x01
. Are there any major differences
in the generated grammar?
Using lookahead for conditional parsing
We have seen previously how we can use unit switch
for conditional parsing.
Another instance of conditional parsing occurs when a protocol message
holds one of multiple possible sub-messages (a union). The sub-messages often
contain a tag to denote what kind of sub-message is transmitted.
With a unit switch
statement we could model this like so.
public type X = unit {
tag: uint8;
switch (self.tag) {
1 -> m1: Message1;
2 -> m2: Message2;
* -> : skip bytes &eod; # For unknown message types simply consume all data.
};
};
type Message1 = unit {
payload: bytes &eod;
};
type Message2 = unit {
payload: bytes &eod;
};
The unit switch
statement has a form without control variable which instead
uses lookahead. With this we can push parsing of the tag
variable into the
units concerned with the particular messages so we keep all pieces related to a
particular message together.
public type X = unit {
switch {
-> m1: Message1;
-> m2: Message2;
-> : skip bytes &eod; # For unknown message types, simply consume all data.
};
};
type Message1 = unit {
: skip uint8(1);
payload: bytes &eod;
};
type Message2 = unit {
: skip uint8(2);
payload: bytes &eod;
};