Error recovery
Even with a grammar perfectly modelling a specification, parsing of real data can fail due to e.g.,
- endpoints not conforming to spec, or
- gaps in the input data due to capture loss.
Instead of altogether aborting parsing we would like to gracefully recover from parse errors, i.e., when the parser encounters a parse error we would like skip input until it can parse again.
Spicy includes support for expressing such recovery with the following model:
-
To resynchronize the input potential synchronization points are annotated, e.g., to synchronize input at the sequence
b"POST"
the grammar might contain a field: b"POST" &synchronize;
All constructs supporting lookahead parsing can be synchronization points, e.g., literals or fields with
unit
type with a literal at a fixed offset. -
On a parse error the unit enters a synchronization trial mode.
Once the input could be synchronized a
%synced
hook is invoked. The implementation of the hook can examine the data up to the&synchronize
field, and eitherconfirm
it to leave trial mode and continue normal parsing, orreject
it to look for a later synchronization point.
Exercises
Let's assume we are parsing a protocol where valid messages are always the
sequence AB
, i.e., a the byte sequence b"AB"
. We will use the following
contrived grammar:
module foo;
public type Messages = unit {
: Message[];
};
type Message = unit {
a: b"A";
b: b"B";
};
on Message::%done { print self; }
-
Validate that this grammar can parse the input
ABABAB
$ printf ABABAB | spicy-driver % [$a=b"A", $b=b"B"] [$a=b"A", $b=b"B"] [$a=b"A", $b=b"B"]
-
What do you see if you pass misspelled input, like with the second
A
changed to1
, i.e., the inputAB1BAB
Why is this particular source range shown as error location?
Solution
[$a=b"A", $b=b"B"] [error] terminating with uncaught exception of type spicy::rt::ParseError: no expected look-ahead token found (foo.spicy:3:30-4:17)
We first the result of parsing for the first
Message
fromAB
, and encounter an error for the second element.The error corresponds to parsing the vector inside
Messages
. The grammar expects eitherA
to start a newMessage
, or end of data to signal the end of the input;1
matches neither so lookahead parsing fails. -
What are the potential synchronization points in this grammar we could use so we can extract the remaining data?
Solution
In this case parsing failed at the first field of
Message
,Message::a
. We coulda. synchronize on
Message::b
by changing it tob: b"B" &synchronize;
b. Synchronize on
Message::a
in the next message, i.e., abandon parsing the remaining fields inMessage
and start over. For that we would synchronize on the vector elements inMessages
,: (Message &synchronize)[];
A slight modification of this grammar seems to fail to synchronize and run into an edge case, https://github.com/zeek/spicy/issues/1594.
-
If you had to choose one, which one would you pick? What are the trade-offs?
Solution
-
If we synchronize on
Message::b
it would seem that we should be able to recover at its data.This however does not work since the vector uses lookahead parsing, so we would fail already in
Messages
before we could recover inMessage
. -
We need to synchronize on the next vector element.
In larger units synchronizing high up (e.g., on a vector in the top-level unit) allows recovering from more general errors at the cost of not extracting some data, e.g., we would be able to also handle misspelled
B
s in this example.
-
-
Add a single
&synchronized
attribute to the grammar so you can handle all possible misspellings. Also add a%synced
hook to confirm the synchronization result (on which unit?). Can you parse inputs like these?ABABAB AB1BAB A11BAB
You can enable the
spicy-verbose
debug stream to show parsing progress.printf AB1BAB | HILTI_DEBUG=spicy-verbose spicy-driver -d foo.spicy
Solution
module foo; public type Messages = unit { : (Message &synchronize)[]; }; type Message = unit { a: b"A"; b: b"B"; }; on Message::%done { print self; } on Messages::%synced { confirm; }