Error recovery
Even with a grammar perfectly modelling a specification, parsing of real data can fail due to e.g.,
- endpoints not conforming to spec, or
- gaps in the input data due to capture loss.
Instead of altogether aborting parsing we would like to gracefully recover from parse errors, i.e., when the parser encounters a parse error we would like skip input until it can parse again.
Spicy includes support for expressing such recovery with the following model:
-
To resynchronize the input potential synchronization points are annotated, e.g., to synchronize input at the sequence
b"POST"the grammar might contain a field: b"POST" &synchronize;All constructs supporting lookahead parsing can be synchronization points, e.g., literals or fields with
unittype with a literal at a fixed offset. -
On a parse error the unit enters a synchronization trial mode.
Once the input could be synchronized a
%syncedhook is invoked. The implementation of the hook can examine the data up to the&synchronizefield, and eitherconfirmit to leave trial mode and continue normal parsing, orrejectit to look for a later synchronization point.
Exercises
Let's assume we are parsing a protocol where valid messages are always the
sequence AB, i.e., a the byte sequence b"AB". We will use the following
contrived grammar:
module foo;
public type Messages = unit {
: Message[];
};
type Message = unit {
a: b"A";
b: b"B";
};
on Message::%done { print self; }
-
Validate that this grammar can parse the input
ABABAB$ printf ABABAB | spicy-driver % [$a=b"A", $b=b"B"] [$a=b"A", $b=b"B"] [$a=b"A", $b=b"B"] -
What do you see if you pass misspelled input, like with the second
Achanged to1, i.e., the inputAB1BABWhy is this particular source range shown as error location?
Solution
[$a=b"A", $b=b"B"] [error] terminating with uncaught exception of type spicy::rt::ParseError: no expected look-ahead token found (foo.spicy:3:30-4:17)We first the result of parsing for the first
MessagefromAB, and encounter an error for the second element.The error corresponds to parsing the vector inside
Messages. The grammar expects eitherAto start a newMessage, or end of data to signal the end of the input;1matches neither so lookahead parsing fails. -
What are the potential synchronization points in this grammar we could use so we can extract the remaining data?
Solution
In this case parsing failed at the first field of
Message,Message::a. We coulda. synchronize on
Message::bby changing it tob: b"B" &synchronize;b. Synchronize on
Message::ain the next message, i.e., abandon parsing the remaining fields inMessageand start over. For that we would synchronize on the vector elements inMessages,: (Message &synchronize)[];A slight modification of this grammar seems to fail to synchronize and run into an edge case, https://github.com/zeek/spicy/issues/1594.
-
If you had to choose one, which one would you pick? What are the trade-offs?
Solution
-
If we synchronize on
Message::bit would seem that we should be able to recover at its data.This however does not work since the vector uses lookahead parsing, so we would fail already in
Messagesbefore we could recover inMessage. -
We need to synchronize on the next vector element.
In larger units synchronizing high up (e.g., on a vector in the top-level unit) allows recovering from more general errors at the cost of not extracting some data, e.g., we would be able to also handle misspelled
Bs in this example.
-
-
Add a single
&synchronizedattribute to the grammar so you can handle all possible misspellings. Also add a%syncedhook to confirm the synchronization result (on which unit?). Can you parse inputs like these?ABABAB AB1BAB A11BABYou can enable the
spicy-verbosedebug stream to show parsing progress.printf AB1BAB | HILTI_DEBUG=spicy-verbose spicy-driver -d foo.spicySolution
module foo; public type Messages = unit { : (Message &synchronize)[]; }; type Message = unit { a: b"A"; b: b"B"; }; on Message::%done { print self; } on Messages::%synced { confirm; }