https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/
logo Software Development Team
* About
* Events
* People
* Projects
* Publications
* Software
* Talks
Cite this paper as: Lukas Diekmann, Laurence Tratt, Don't Panic!
Better, Fewer, Syntax Errors for LR Parsers, ECOOP 2020, July 2020.
Also available as: PDF.
See also: Experiment.
Don't Panic! Better, Fewer, Syntax Errors for LR Parsers
Lukas Diekmann and Laurence Tratt
July, 2020
Abstract Syntax errors are generally easy to fix for humans, but not
for parsers in general nor LR parsers in particular. Traditional
'panic mode' error recovery, though easy to implement and applicable
to any grammar, often leads to a cascading chain of errors that drown
out the original. More advanced error recovery techniques suffer less
from this problem but have seen little practical use because their
typical performance was seen as poor, their worst case unbounded, and
the repairs they reported arbitrary. In this paper we introduce the
CPCT+ algorithm, and an implementation of that algorithm, that
address these issues. First, CPCT+ reports the complete set of
minimum cost repair sequences for a given location, allowing
programmers to select the one that best fits their intention. Second,
on a corpus of 200,000 real-world syntactically invalid Java
programs, CPCT+ is able to repair 98.37%+-0.017% of files within a
timeout of 0.5s. Finally, CPCT+ uses the complete set of minimum cost
repair sequences to reduce the cascading error problem, where
incorrect error recovery causes further spurious syntax errors to be
identified. Across the test corpus, CPCT+ reports 435,812+-473 error
locations to the user, reducing the cascading error problem
substantially relative to the 981,628+-0 error locations reported by
panic mode.
1 Introduction
Programming is a humbling job which requires acknowledging that we
will make untold errors in our quest to perfect a program. Most
troubling are semantic errors, where we intended the program to do
one thing, but it does another. Less troubling, but often no less
irritating, are syntax errors, which are generally minor deviances
from the exacting syntax required by a compiler. So common are syntax
errors that parsers in modern compilers are designed to cope with us
making several: rather than stop on the first syntax error, they
attempt to recover from it. This allows them to report, and us to
fix, all our syntax errors in one go.
When error recovery works well, it is a useful productivity gain.
Unfortunately, most current error recovery approaches are simplistic.
The most common grammar-neutral approach to error recovery are those
algorithms described as 'panic mode' (e.g. [13, p. 348]) which skip
input until the parser finds something it is able to parse. A more
grammar-specific variation of this idea is to skip input until a
pre-determined synchronisation token (e.g. ';' in Java) is reached [8
, p. 3], or to try inserting a single synchronisation token. Such
strategies are often unsuccessful, leading to a cascade of spurious
syntax errors (see Figure 1 for an example). Programmers quickly
learn that only the location of the first error in a file - not the
reported repair, nor the location of subsequent errors - can be
relied upon to be accurate.
---------------------------------------------------------------------
(a) class C {
int x y;
}
Parsing error at line 2 col 9. Repair
(c) sequences found:
1: Delete y
2: Insert ,
3: Insert =
C.java:2: error: ';' expected
int x y;
(b) ^
C.java:2: error: expected
int x y;
^
Figure 1: An example of a simple, common Java syntax error (a) and
the problems traditional error recovery has in dealing with it. javac
(b) spots the error when it encounters 'y'. Its error recovery
algorithm then repairs the input by inserting a semicolon before 'y'
(i.e. making the input equivalent to 'int x; y;'). This then causes a
spurious parsing error, since 'y' on its own is not a valid
statement. The CPCT+ error recovery algorithm we introduce in this
paper produces the output shown in (c): after spotting an error when
parsing encounters 'y', it uses the Java grammar to find the complete
set of minimum cost repair sequences (unlike previous approaches
which non-deterministically find one minimum cost repair sequence).
In this case three repair sequences are reported to the user: one can
delete 'y' entirely ('int x;'), or insert a comma ('int x, y;'), or
insert an equals sign ('int x = y;').
---------------------------------------------------------------------
It is possible to hand-craft error recovery algorithms for a specific
language. These generally allow better recovery from errors, but are
challenging to create. For example, the Java error recovery approach
in the Eclipse IDE is 5KLoC long, making it only slightly smaller
than a modern version of Berkeley Yacc -- a complete parsing system!
Unsurprisingly, few real-world parsers contain effective hand-written
error recovery algorithms.
Most of us are so used to these trade-offs (cheap generic algorithms
and poor recovery vs. expensive hand-written algorithms and
reasonable recovery) that we assume them to be inevitable. However,
there is a long line of work on more advanced generic error recovery
algorithms. Probably the earliest such algorithm is Aho and Peterson
[1], which, upon encountering an error, creates on-the-fly an
alternative (possibly ambiguous) grammar which allows the parser to
recover. This algorithm has fallen out of favour in programming
language circles, probably because of its implementation complexity
and the difficulty of explaining to users what recovery has been
used. A simpler family of algorithms, which trace their roots to
Fischer et al. [11], instead try to find a single minimum cost repair
sequence of token insertions and deletions which allow the parser to
recover. Algorithms in this family are much better at recovering from
errors than naive approaches and can communicate the repairs they
find in a way that humans can easily replicate. However, such
algorithms have seen little practical use because their typical
performance is seen as poor and their worst case unbounded [17,
p. 14]. We add a further complaint: such approaches only report a
single repair sequence to users. In general - and especially in
syntactically rich languages - there are multiple reasonable repair
sequences for a given error location, and the algorithm has no way of
knowing which best matches the user's intentions.
In this paper we introduce a new error recovery algorithm in the
Fischer et al. family, CPCT+. This takes the approach of Corchuelo et
al. [5] as a base, corrects it, expands it, and optimises its
implementation. CPCT+ is simple to implement (under 500 lines of Rust
code), is able to repair nearly all errors in reasonable time,
reports the complete set of minimum cost repair sequences to users,
and does so in less time than Corchuelo et al..
We validate CPCT+ on a corpus of 200,000 real, syntactically
incorrect, Java programs (Section 6). CPCT+ is able to recover 98.37%
+-0.017% of files within a 0.5s timeout and does so while reporting
fewer than half the error locations as a traditional panic mode
algorithm: in other words, CPCT+ substantially reduces the cascading
error problem. We also show - for, as far as we know, the first time
- that advanced error recovery can be naturally added to a Yacc-esque
system, allowing users to make fine-grained decisions about what to
do when error recovery has been applied to an input (Section 7). We
believe that this shows that algorithms such as CPCT+ are ready for
wider usage, either on their own, or as part of a multi-phase
recovery system.
1.1 Defining the problem
Formally speaking, we first test the following hypothesis:
H1
The complete set of minimum cost repair sequences can be found in
acceptable time.
The only work we are aware of with a similar concept of 'acceptable
time' is [6], who define it as the total time spent in error recovery
per file, with a threshold of 1s. We use that definition with one
change: Since many compilers are able to fully execute in less than
1s, we felt that a tighter threshold is more appropriate: we use 0.5s
since we think that even the most demanding users will tolerate such
a delay. We strongly validate this hypothesis.
The complete set of minimum cost repair sequences makes it more
likely that the programmer will see a repair sequence that matches
their original intention (see Figure 1 for an example; Appendix A
contains further examples in Java, Lua, and PHP). It also opens up a
new opportunity. Previous error recovery algorithms find a single
repair sequence, apply that to the input, and then continue parsing.
While that repair sequence may have been a reasonable local choice,
it may cause cascading errors later. Since we have the complete set
of minimum cost repair sequences available, we can select from that
set a repair sequence which causes fewer cascading errors. We thus
rank repair sequences by how far they allow parsing to continue
successfully (up to a threshold -- parsing the whole file would, in
general, be too costly), and choose from the subset that gets
furthest (note that the time required to do this is included in the
0.5s timeout). We thus also test a second hypothesis:
H2
Ranking the complete set of minimum cost repair sequences by how
far they allow parsing to continue locally reduces the global
cascading error problem.
We also strongly validate this hypothesis. We do this by comparing
'normal' CPCT+ with a simple variant CPCT+rev which reverses the
ranking process, always selecting from amongst the worst performing
minimum cost repair sequence. CPCT+rev models the worst case of
previous approaches in the Fischer et al. family, which
non-deterministically select a single minimum cost repair sequence.
CPCT+rev leads to 31.93%+-0.289% more errors being reported (i.e. it
substantially worsens the cascading error problem).
This paper is structured as follows. We describe the Corchuelo et al.
algorithm (Section 4), filling in missing details from the original
description and correcting its definition. We then expand the
algorithm into CPCT+ (Section 5). Finally, we validate CPCT+ on a
corpus of 200,000 real, syntactically incorrect, Java programs
comparing it to implementations of panic mode and Corchuelo et al.
(Section 6). To emphasise that our algorithms are grammar-neutral, we
show examples of error recovery on different grammars in Appendix A.
2 Background
We assume a high-level understanding of the mechanics of parsing in
this paper, but in this section we provide a handful of definitions,
and a brief refresher of relevant low-level details, needed to
understand the rest of this paper. Although the parsing tool we
created for this paper is written in Rust, we appreciate that this is
still an unfamiliar language to many readers: algorithms are
therefore given in Python which, we hope, is familiar to most.
Although there are many flavours of parsing, the Fischer et al.
family of error recovery algorithms are designed to be used with LR(k
) parsers [16]. LR parsing remains one of the most widely used
parsing approaches due to the ubiquity of Yacc [14] and its
descendants (which include the Rust parsing tool we created for this
paper). We use Yacc syntax throughout this paper so that examples can
easily be tested in Yacc-compatible parsing tools.
Yacc-like tools take in a Context-Free Grammar (CFG) and produce a
parser from it. The CFG has one or more rules; each rule has a name
and one or more productions (often called 'alternatives'); each
production contains one or more symbols; and a symbol references
either a token type or a grammar rule. One rule is designated the
start rule. The resulting parser takes as input a stream of tokens,
each of which has a type (e.g. INT) and a value (e.g. 123) - we
assume the existence of a Lex-like tool which can split a string into
a stream of tokens. Strictly speaking, parsing is the act of
determining whether a stream of tokens is correct with respect to the
underlying grammar. Since this is rarely useful on its own, Yacc-like
tools allow grammars to specify 'semantic actions' which are executed
when a production in the grammar is successfully matched. Except
where stated otherwise, we assume that the semantic actions build a
parse tree, ordering the tokens into a tree of nonterminal nodes
(which can have children) and terminal nodes (which cannot have
children).
---------------------------------------------------------------------
Figure 2: An example grammar (top left), its corresponding stategraph
(right), and statetable (split into separate action and goto tables;
bottom left). Productions in the grammar are labelled (I) to (VI). In
the stategraph: S(x) means 'shift to state x'; R(x) means 'reduce
production x from the grammar' (e.g. action(3, '+') returns R(IV)
which references the production 'Factor: Term;'). Each item within a
state [N : a [?] b] references one of rule N's productions; a and b
each represent zero or more symbols; with the dot ([?]) representing
how much of the production must have been matched (a) if parsing has
reached that state, and how much remains (b).
---------------------------------------------------------------------
The CFG is first transformed into a stategraph, a statemachine where
each node contains one or more items (describing the valid parse
states at that point) and edges are labelled with terminals or
nonterminals. Since even on a modern machine, a canonical
(i.e. unmerged) LR stategraph can take several seconds to build, and
a surprising amount of memory to store, we use the algorithm of [21]
to merge together compatible states^1 . The effect of this is
significant, reducing the Java grammar we use later from 8908 to 1148
states. The stategraph is then transformed into a statetable with one
row per state. Each row has a possibly empty action (shift, reduce,
or accept) for each terminal and a possibly empty goto state for each
nonterminal. Figure 2 shows an example grammar, its stategraph, and
statetable.
The statetable allows us to define a simple, efficient, parsing
process. We first define two functions relative to the statetable:
action(s,t) returns the action for the state s and token t or error
if no such action exists; and goto(s,N) returns the goto state for
the state s and the nonterminal N or error if no such goto state
exists. We then define a reduction relation -LR for (parsing stack,
token list) pairs with two reduction rules as shown in Figure 3. A
full LR parse -*LR repeatedly applies the two -LR rules until neither
applies, which means that action(s[n],t[0]) is either: accept
(i.e. the input has been fully parsed); or error(i.e. an error has
been detected at the terminal t[0]). A full parse takes a starting
pair of ([0], [t[0 ] ... t[n],$]), where state 0 is expected to
represent the entry point into the stategraph, t[0 ] ... t[n] is the
sequence of input tokens, and '$' is the special End-Of-File (EOF)
token.
---------------------------------------------------------------------
action(sn,t0)= shift s'----------------'--------- LR Shift([s0...sn],
[t0...tn]) -LR ([s0...sn,s],[t1...tn])
'(action(sn,t0) =-reduce N-: a-)[?](goto(sn-|a|,N-) =-s-)LR Reduce
([s0...sn],[t0...tn]) -LR ([s0...sn- |a|,s'],[t0 ...tn])
Figure 3: Reduction rules for -LR, which operate on (parsing stack,
token list) pairs. LR Shift advances the input by one token and grows
the parsing stack, while LR Reduce unwinds ('reduces') the parsing
stack when a production is complete before moving to a new ('goto')
state.
---------------------------------------------------------------------
3 Panic mode
Error recovery algorithms are invoked by a parser when it has yet to
finish but there is no apparent way to continue parsing (i.e. when
action(s[n],t[0]) = error). Error recovery algorithms are thus called
with a parsing stack and a sequence of remaining input (which we
represent as a list of tokens): they can modify either or both of the
parsing stack and the input in their quest to get parsing back on
track. The differences between algorithms are thus in what
modifications they can carry out (e.g. altering the parse stack;
deleting input; inserting input), and how they carry such
modifications out.
The simplest grammar-neutral error recovery algorithms are widely
called 'panic mode' algorithms (the origin of this family of
algorithms seems lost in time). While there are several members of
this family for LL parsing, there is only one fundamental way of
creating a grammar-neutral panic mode algorithm for LR parsing. We
take our formulation from Holub [13, p. 348]^2 . The algorithm works
by popping elements off the parsing stack to see if an earlier part
of the stack is able to parse the next input symbol. If no element in
the stack is capable of parsing the next input symbol, the next input
symbol is skipped, the stack restored, and the process repeated. At
worst, this algorithm guarantees to find a match at the EOF token.
Figure 4 shows a more formal version of this algorithm.
---------------------------------------------------------------------
1def holub(pstack, toks):
2 while len(toks) > 0:
3 npstack = pstack.copy()
4 while len(npstack) > 0:
5 if action(npstack[-1], toks[0]) != error:
6 return (npstack, toks)
7 npstack.pop()
8 del toks[0]
9 return None
Figure 4: Our version of the Holub [13] algorithm. This panic mode
algorithm takes in a (parsing stack, token list) pair and returns: a
(parsing stack, token list) pair if it managed to recover; or None if
it failed to recover. The algorithm tries to find an element in the
stack that has a non-error action for the next token in the input
(lines 4-7). If it fails to find such an element, the input is
advanced by one element (line 8) and the stack restored (line 3).
---------------------------------------------------------------------
The advantage of this algorithm is its simplicity and speed. For
example, consider the grammar from Figure 2 and the input '2 + + 3'.
The parser encounters an error on the second '+' token, leaving it
with a parsing stack of [0, 2, 7] and the input '+ 3' remaining. The
error recovery algorithm now starts. It first tries action(7, '+')
which (by definition, since it is the place the parser encountered an
error) returns error; it then pops the top element from the parsing
stack and tries action(2, '+'), which returns shift. This is enough
for the error recovery algorithm to complete, and parsing resumes
with a stack [0, 2].
The fundamental problem with error recovery can be seen from the
above example: by popping from the parsing stack, it implicitly
deletes input from before the error location (in this case the first
'+') in order to find a way of parsing input after the error
location. This often leads to panic mode throwing away huge portions
of the input in its quest to find a repair. Not only can the
resulting recovery appear as a Deus ex machina, but the more input
that is skipped, the more likely that a cascade of further parsing
errors ensues (as we will see later in Section 6.2).
4 Corchuelo et al.
There have been many attempts to create better LR error recovery
algorithms than panic mode. Most numerous are those error recovery
algorithms in what we call the Fischer et al. family. Indeed, there
are far too many members of this family of algorithms to cover in one
paper. We therefore start with one of the most recent - Corchuelo et
al. [5]. We first explain the original algorithm (Section 4.1),
although we use different notation than the original, fill in several
missing details, and provide a more formal definition. We then make
two correctness fixes to ensure that the algorithm always finds
minimum cost repair sequences (Section 4.2). Since the original gives
few details as to how the algorithm might best be implemented, we
then explain our approach to making a fast implementation (Section
4.3).
4.1 The original algorithm
Intuitively, the Corchuelo et al. algorithm starts at the error state
and tries to find a minimum cost repair sequence consisting of:
insert T ('insert a token of type T'), delete ('delete the token at
the current offset'), or shift ('parse the token at the current
offset'). The algorithm completes: successfully if it reaches an
accept state or shifts 'enough' tokens (N[shifts], set at 3 in
Corchuelo et al.); or unsuccessfully if a repair sequence contains
too many delete or insert repairs (set at 3 and 4 respectively in
Corchuelo et al.) or spans 'too much' input (N[total], set at 10 in
Corchuelo et al.). Repair sequences are reported back to users with
trailing shift repairs pruned i.e. [insert x, shift y, delete z,
shift a, shift b, shift c] is reported as [insert x, shift y, delete
z].
In order to find repair sequences, the algorithm keeps a
breadth-first queue of configurations, each of which represents a
different search state; configurations are queried for their
neighbours which are put into the queue; and the search terminates
when a successful configuration is found. The cost of a configuration
is the sum of the repair costs in its repair sequence. By definition,
a configuration's neighbours have the same, or greater, cost to it.
---------------------------------------------------------------------
action(sn,t)/=error[?] t/= $[?]([s0...sn],[t,t0 ...tn])- *LR ([s'...s'm
],[t0...tn])------------------'---'-----------0--------------CR
Insert([s0...sn],[t0...tn])-CR ([s0...sm ],[t0...tn],[insert t])
t /=$--------0------------------------CR Delete([s0...sn],
[t0,t1...tn])-CR ([s0...sn],[t1...tn],[delete])
* ' '([s0...sn],[t0...tn]) -LR ([s0.'..sm],[tj...tn])[?] 0< j <=
Nshiftsj=Nshifts[?]action(sm,tj)[?]-{accept,error}--------- CR Shift 1
([s0...sn],[t0...tn]) -CR ([s'0...s'm],[tj...tn],[s*hift.*.*.shift*])
j
Figure 5: The repair-creating reduction rules for Corchuelo et al..
CR Insert finds all terminals reachable from the current state and
creates insert repairs for them (other than the EOF token '$'). CR
Delete creates deletion repairs if user defined input remains. CR
Shift 1 parses at least 1 and at most N[shifts] tokens; if it reaches
an accept or error state, or parses exactly N[shifts] tokens, then a
shift repair per token shifted is created.
---------------------------------------------------------------------
---------------------------------------------------------------------
1def corchueloetal(pstack, toks):
2 todo = [[(pstack, toks, [])]]
3 cur_cst = 0
4 while cur_cst < len(todo):
5 if len(todo[cur_cst]) == 0:
6 cur_cst += 1
7 continue
8 n = todo[cur_cst].pop()
9 if action(n[0][-1], n[1][0]) == accept \
10 or ends_in_N_shifts(n[2]):
11 return n
12 elif len(n[1]) - len(toks) == N_total:
13 continue
14 for nbr in all_cr_star(n[0], n[1]):
15 if len(n[2]) > 0 and n[2][-1] == delete \
16 and nbr[2][-1] == insert:
17 continue
18 cst = cur_cst + rprs_cst(nbr[2])
19 for _ in range(len(todo), cst):
20 todo.push([])
21 todo[cst].append((nbr[0], nbr[1], \
22 n[2] + nbr[2]))
23 return None
24
25def rprs_cst(rprs):
26 c = 0
27 for r in rprs:
28 if r == shift: continue
29 c += 1
30 return c
31
32def all_cr_star(pstack, toks):
33 # Exhaustively apply the -* CR relation to
34 # (pstack, toks) and return the resulting
35 # list of (pstack, toks, repair) triples.
Figure 6: Our version of the Corchuelo et al. algorithm. The main
function corchueloetal takes in a (parsing stack, token list) pair
and returns: a (parsing stack, token list, repair sequence) triple
where repair sequence is guaranteed to be a minimum cost repair
sequence; or None if it failed to find a repair sequence.
The algorithm maintains a todo list of lists: the first sub-list
contains configurations of cost 0, the second sub-list configurations
of cost 1, and so on. The todo list is initialised with the error
parsing stack, remaining tokens, and an empty repair sequence (line
2). If there are todo items left, a lowest cost configuration n is
picked (lines 4-8). If n represents an accept state (line 9) or if
the last N[shifts] repairs are shifts (line 10), then n represents a
minimum cost repair sequence and the algorithm terminates
successfully (line 11). If n has already consumed N[total] tokens,
then it is discarded (lines 12, 13). Otherwise, n's neighbours are
gathered using the -CR relation (lines 14, 32-35). To avoid duplicate
repairs, delete repairs never follow insert repairs (lines 15-17).
Each neighbour has its repairs costed (line 18) and is then assigned
to the correct todo sub-list (lines 21-22).
The rprs_cst function returns the cost of a repair sequence. Inserts
and deletes cost 1, shifts 0.
---------------------------------------------------------------------
As with the original, we explain the approach in two parts. First is
a new reduction relation -CR which defines a configuration's
neighbours (Figure 5). Second is an algorithm which makes use of the
-CR relation to generate neighbours, and determines when a successful
configuration has been found or if error recovery has failed (Figure
6). As well as several changes for clarity, the biggest difference is
that Figure 6 captures semi-formally what Corchuelo et al. explain in
prose (spread amongst several topics over several pages): perhaps
inevitably we have had to fill in several missing details. For
example, Corchuelo et al. do not define what the cost of repairs is:
for simplicities sake, we define the cost of insert and delete as 1,
and shift as 0.^3
4.2 Ensuring that minimum cost repair sequences aren't missed
CR Shift 1 (see Figure 5) has two flaws which prevent it from
generating all possible minimum cost repair sequences.
---------------------------------------------------------------------
([s0...sn],[t0...tn])- *LR ([s'0...s'm ],[tj...tn]) [?]0<= j <= Nshifts(j
=0[?][s0...sn]/= [s'0...s'm ])[?] j = Nshifts[?] action(s'm,tj)[?]
{accept,error}([s...s-],[t-...t])----([s'...s'-],[t-...t-],
[shift...shift])-------CR Shift 20 n 0 n CR 0 m j n *---**---* j
* ' ((j[s =0.0.[?].sn[s],[.t..0s...]t/=n])[s-'.L.R. (s[s0].[?].R.s=m],
[][)tj[?]..(j.tn=]1)[?][?]R0<== j[s <=hi1ft])
---------0---n----0---m-----'---'------------------CR Shift 3 ([s0
...sn],[t0...tn])-CR ([s0...sm ],[tj...tn],R )
Figure 7: CR Shift 1 always consumes input, when sometimes performing
one or more reduction/gotos without consuming input would be better.
CR Shift 2 addresses this issue. Both CR Shift 1 and CR Shift 2
generate multiple shift repairs in one go, which causes them to skip
'intermediate' (and sometimes important) configurations. CR Shift 3
generates at most one shift, exploring all intermediate
configurations.
---------------------------------------------------------------------
---------------------------------------------------------------------
Delete 3, Delete +
(a) Delete 3, Shift +, Insert Int (b) Insert *, Shift 3, Delete +
Insert +, Shift 3, Shift +, Insert Int Insert +, Shift 3, Delete +
Insert *, Shift 3, Shift +, Insert Int
Figure 8: Given the input '2 3 +' and the grammar from Figure 2, CR
Shift 1 is unable to find any repair sequences because it does not
perform the reductions/gotos necessary after the final insert or
delete repairs to reach an accept state. (a) CR Shift 2 can find 4
minimum cost repair sequences. (b) CR Shift 3 can find a further 2
minimum cost repair sequences on top those found by CR Shift 2 (i.e.
6 in total).
---------------------------------------------------------------------
First, CR Shift 1 always consumes input, missing intermediate
configurations (including accept states!) that only require
reductions/gotos to be performed. CR Shift 2 in Figure 7 shows the
two-phase fix which addresses this problem. We first change the
condition 0 < j <=N[shifts] to 0 <=j <=N[shifts] so that the parser can
make progress without consuming input. However, this opens the
possibility of an infinite loop, so we then add a condition that if
no input is consumed, the parsing stack must have changed. In other
words, we require progress to be made, whether or not that progress
involved consuming input.
Second, CR Shift 1 and CR Shift 2 generate multiple shifts at a time.
This causes them to skip intermediate configurations from which
minimum cost repair sequences may be found. The solution^4 is simple:
at most one shift can be generated at any one time. CR Shift 3 in
Figure 7 (as well as incorporating the fix from CR Shift 2) generates
at most one shift repair at a time. Relative to CR Shift 1, it is
simpler, though it also inevitably slows down the search, as more
configurations are generated.
The problems with CR Shift 1, in particular, can be severe. Figure 8
shows an example input where CR Shift 1 is unable to find any repair
sequences, CR Shift 2 some, and CR Shift 3 all minimum cost repair
sequences.
4.3 Implementation considerations
The definitions we have given thus far do not obviously lead to an
efficient implementation and Corchuelo et al. give few useful hints.
We found that two techniques were both effective at improving
performance while being simple to implement.
First, Corchuelo et al. suggest using a breadth-first search but give
no further details. It was clear to us that the most natural way to
model the search is as an instance of Dijkstra's algorithm. However,
rather than use a general queue data-structure (probably based on a
tree) to discover which element to search next, we use a similar
queue data-structure to [4, p. 25]. This consists of one sub-list per
cost (i.e. the first sub-list contains configurations of cost 0, the
second sub-list configurations of cost 1 and so on). Since we always
know what cost we are currently investigating, finding the next todo
element requires only a single pop (line 8 of Figure 6). Similarly,
adding elements requires only an append to the relevant sub-list
(lines 18, 21, 22). This data-structure is a good fit because costs
in our setting are always small (double digits is unusual for
real-world grammars) and each neighbour generated from a
configuration with cost c has a cost >= c.
Second, since error recovery frequently adjusts and resets parsing
stacks and repair sequences, during error recovery we do not
represent these as lists (as is the case during normal parsing). We
found that lists consume noticeably more memory, and are slightly
less efficient, than using parent pointer trees (often called
'cactuses'). Every node in such a tree has a reference to a single
parent (or null for the root node) but no references to child nodes.
Since our implementation is written in Rust - a language without
garbage collection - nodes are reference counted (i.e. a parent is
only freed when it is not in a todo list and no children point to
it). When the error recovery algorithm starts, it converts the main
parsing stack (a list) into a parent pointer tree; and repair
sequences start as empty parent pointer trees. The -CR part of our
implementation thus operates exclusively on parent pointer trees.
Although this does mean that neighbouring configurations are
scattered throughout memory, the memory sharing involved seems to
more than compensate for poor cache behaviour; it also seems to be a
good fit with modern malloc implementations, which are particularly
efficient when allocating and freeing objects of the same size. This
representation is likely to be a reasonable choice in most contexts,
although it is difficult to know from our experience whether it will
always be the best choice (e.g for garbage collected languages).
One seemingly obvious improvement is to split the search into
parallel threads. However, we found that the nature of the problem
means that parallelisation is more tricky, and less productive, than
might be expected. There are two related problems: we cannot tell in
advance if a given configuration will have huge numbers of successors
or none at all; and configurations are, in general, searched for
successors extremely quickly. Thus if we attempt to seed threads with
initial sets of configurations, some threads quickly run out of work
whilst others have ever growing queues. If, alternatively, we have a
single global queue then significant amounts of time can be spent
adding or removing configurations in a thread-safe manner. A work
stealing algorithm might solve this problem but, as we shall see in
Section 6, CPCT+ runs fast enough that the additional complexity of
such an approach is not, in our opinion, justified.
5 CPCT+
In this section, we extend the Corchuelo et al. algorithm to become
what we call CPCT+. First we enhance the algorithm to find the
complete set of minimum cost repair sequences (Section 5.1). Since
this slows down the search, we optimise by merging together
compatible configurations (Section 5.2). The complete set of minimum
cost repair sequences allows us to make an algorithm less susceptible
to the cascading error problem (Section 5.3). We then change the
criteria for terminating error recovery (Section 5.4).
5.1 Finding the complete set of minimum cost repair sequences
The basic Corchuelo et al. algorithm non-deterministically completes
as soon as it has found a single minimum cost repair sequence. This
is confusing in two different ways: the successful repair sequence
found can vary from run to run; and the successful repair sequence
might not match the user's intention.
We therefore introduce the idea of the complete set of minimum cost
repair sequences: that is, all equivalently good repair sequences.
Although we will refine the concept of 'equivalently good' in
Section 5.3, at this stage we consider all successful repair
sequences with minimum cost c to be equivalently good. In other
words, as soon as we find the first successful repair sequence, its
cost c defines the minimum cost.
An algorithm to generate this set is then simple: when a repair
sequence of cost c is found to be successful, we discard all repair
sequences with cost > c, and continue exploring configurations in
cost c (including, transitively, all neighbours that are also of cost
c; those with cost > c are immediately discarded). Each successful
configuration is recorded and, when all configurations in c have been
explored, the set of successful configurations is returned. One of
these successful configurations is then non-deterministically chosen,
applied to the input, and parsing continued.
5.2 Merging compatible configurations
Relative to finding a single solution, finding the complete set of
repair sequences can be extremely expensive because there may be many
remaining configurations in c, which may, transitively, have many
neighbours. Our solution to this performance problem is to merge
together compatible configurations on-the-fly, preserving their
distinct repair sequences while reducing the search space. Two
configurations are compatible if:
1.
their parsing stacks are identical,
2.
they both have an identical amount of input remaining,
3.
and their repair sequences are compatible.
Two repair sequences are compatible:
1.
if they both end in the same number (n >=0) of shifts,
2.
and, if one repair sequence ends in a delete, the other repair
sequence also ends in a delete.
The first of these conditions is a direct consequence of the fact
that a configuration is deemed successful if it ends in N[shifts]
shift repairs. When we merge configurations, one part of the merge is
'dominant' (i.e. checked for N[shifts]) and the other 'subsumed': we
have to maintain symmetry between the two to prevent the dominant
part accidentally preventing the subsumed part from being recorded as
successful. In other words, if the dominant part of the merge had
fewer shifts at the end of its repair sequence than the subsumed
part, then the N[shifts] check (line 10, Figure 6) would fail, even
though reversing the dominant and subsumed parts may have lead to
success. It is therefore only safe to merge repair sequences which
end in the same number of shifts.
The second condition relates to the weak form of compatible merging
inherited from [5, p. 8]: delete repairs are never followed by an
insert (see Figure 6) since [delete, insert x] always leads to the
same configuration as [insert x, delete]. Although we get much of the
same effect through compatible configuration merging, we keep it as a
separate optimisation because: it is such a frequent case; our use of
the todo list means that we would not catch every case; the duplicate
repair sequences are uninteresting from a user perspective, so we
would have to filter them out later anyway; and each additional merge
costs memory. We thus have to make sure that merged repair sequences
don't accidentally suppress insert repairs because one part of the
repair sequence ends in a delete while the other does not. The
simplest way of solving this problem is thus to forbid merging repair
sequences if one sequence ends in a delete and the other does not.
Fortunately, implementing compatible configuration merging is simple.
We first modify the todo data-structure to be a
list-of-ordered-hashsets^5. This has near-identical append / pop
performance to a normal list, but filters out duplicates with
near-identical performance to an unordered hashset. We then make use
of a simple property of hashsets: an object's hashing behaviour need
only be a non-strict subset of its equality behaviour. Configuration
hashing is based solely on a configuration's parsing stack and
remaining input, giving us a fast way of finding configurations that
are compatible under conditions #1 (identical parsing stacks) and #2
(identical input remaining). As well as checking those two
conditions, configuration equality also checks condition #3
(compatible repair sequences).
Conceptually, merging two configurations together is simple: each
configuration needs to store a set of repair sequences, each of which
is updated as further repairs are found. However, this is an
extremely inefficient representation as the sets involved need to be
copied and extended as each new repair is found. Instead, we reuse
the idea of graph-structured stacks from GLR parsing [28, p. 4] which
allows us to avoid copying whenever possible. The basic idea is that
configurations no longer reference a parent pointer tree of repairs
directly, but instead a parent pointer tree of repair merges. A
repair merge is a pair (repair, merged) where repair is a plain
repair and merged is a (possibly null) set of repair merge sequences.
This structure has two advantages. First, the N[shifts] check can be
performed solely using the first element of repair merge pairs.
Second, we avoid allocating memory for configurations which have not
yet been subject to a merge. The small downside to this scheme is
that expanding configurations into repair sequences requires
recursively expanding both the normal parent pointer tree of the
first element as well as the merged parent pointer trees of the
second element.
Compatible configuration merging is particularly effective in complex
cases, even though it can only merge configurations in the todo list
(i.e. we cannot detect all possible compatible merges). An example of
compatible configuration merging can be seen in Figure 9.
5.3 Ranking repair sequences
---------------------------------------------------------------------
PIC
Figure 9: An elided visualisation of a real run of CPCT+ with the
input '2 3 +' and the grammar from Figure 2. The left hand side of
the tree shows the 'normal' parser at work, which hits an error as
soon as it has shifted the token '2': at this point, CPCT+ starts
operating. As this shows, the search encounters various dead ends, as
well as successful routes. As shown in Figure 8, this input has 6
minimum cost repair sequences, but the search only has 5 success
configurations, because two configurations were merged together.
---------------------------------------------------------------------
---------------------------------------------------------------------
class C { Parsing Error at line 3 col 7. Repair
(a) T x = 2 + (b) sequences found:
T y = 3; 1: Insert ,
}
Figure 10: An example showing how the ranking of repair sequences can
lessen the cascading error problem. The Java example (a) leads to a
parsing error on line 3 at 'y', with three minimum cost repair
sequences found: [insert ,], [insert ?], and [insert (]. These repair
sequences are then ranked by how far they allow parsing to continue
successfully. [insert ,] leads to the rest of the file being parsed
without further error. [insert ?] causes a cascading error at ';'
which must then be resolved by completing the ternary expression
started by '?' (e.g. changing line 3 to 'T ? y : this;'). Similarly,
[insert (] causes a cascading error at ';' which must then be
resolved by inserting a ')'. Since [insert ,] is ranked more highly
than the other repair sequences, the latter are discarded, leading to
the parsing output shown in (b). javac in contrast attempts to insert
';' before 'y' causing a cascading error on the next token.
---------------------------------------------------------------------
In nearly all cases, members of the complete set of minimum cost
repair sequences end with N[shifts ] (the only exception being error
locations near the end of an input where recovery leads to an accept
state). Thus while the repair sequences we find are all equivalently
good within the range of N[shifts ] , some, but not others, may
perform poorly beyond that range. This problem is exacerbated by the
fact that N[shifts] has to be a fairly small integer (we use 3, the
value suggested by Corchuelo et al.) since each additional token
searched exponentially increases the search space. Thus while all
repair sequences found may be locally equivalent, when considered in
the context of the entire input, some may be better than others.
While it is, in general, impractically slow to determine which repair
sequences are the global best, we can quickly determine which are
better under a wider definition of 'local'.
We thus rank configurations which represent the complete set of
minimum cost repair sequences by how far they allow parsing to
continue, up to a limit of N[try] tokens (which we somewhat
arbitrarily set at 250). Taking the furthest-parse point as our top
rank, we then discard all configurations which parsed less input than
this. The reason why we rank the configurations, and not the repair
sequences, is that we only need to rank one of the repair sequences
from each merged configuration, a small but useful optimisation. We
then expand the top ranked configurations into repair sequences and
remove shifts from the end of those repair sequences. Since the
earlier merging of compatible configurations is imprecise (it misses
configurations that have already been processed), there can be some
remaining duplicate repair sequences: we thus perform a final purge
of duplicate repair sequences. Figure 9 shows a visualisation of CPCT
+ in action.
Particularly on real-world grammars, selecting the top-ranked repair
sequences substantially decreases cascading errors (see Figure 10 for
an example). It also does so for very little additional computational
cost, as the complete set of minimum cost repair sequences typically
contains only a small number of items. However, it cannot entirely
reduce the cascading error problem. Since, from our perspective, each
member of the top-ranked set is equivalently good, we
non-deterministically select one of its members to repair the input
and allow parsing to continue. This can mean that we select a repair
sequence which performs less well beyond N[try] tokens than other
repair sequences in the top-ranked set.
5.4 Timeout
The final part of CPCT+ relates to the use of N[total] in Corchuelo
et al.. As with all members of the Fischer et al. family, CPCT+ is
not only unbounded in time [17, p. 14], but also unbounded in memory.
In an attempt to combat this, Corchuelo et al. limits repair
sequences to a maximum of 3 deletes and 4 inserts and a span of at
most 10 tokens, attempting to stop the search from going too far.
Unfortunately it is impossible to find good values for these
constants, as 'too far' is entirely dependent on the grammar and
erroneous input: Java's grammar, for example, is large with a
commensurately large search space (requiring smaller constants) while
Lua's grammar is small with a commensurately small search space
(which can cope with larger constants).
This problem can be easily seen on inputs with unbalanced brackets
(e.g. expressions such as 'x = f(();'): each additional unmatched
bracket exponentially increases the search space. On a modern machine
with a Java 7 grammar, CPCT+ takes about 0.3s to find the complete
set of minimum cost repair sequences for 3 unmatched brackets, 3s for
4 unmatched brackets, and 6 unmatched brackets caused our 32GiB test
machine to run out of RAM.
The only sensible alternative is a timeout: up to several seconds is
safe in our experience. We thus remove N[total] from CPCT+ and rely
entirely on a timeout which, in this paper, is defined to be 0.5s.
6 Experiment
In order to understand the performance of CPCT+, we conducted a large
experiment on real-world Java code. In this section we outline our
methodology (Section 6.1) and results (Section 6.2). Our experiment
is fully repeatable and downloadable from https://archive.org/
download/error_recovery_experiment/0.4/. The results from our
particular run of the experiment can also be downloaded from the same
location.
6.1 Methodology
In order to evaluate error recovery implementations, we need a
concrete implementation. We created a new Yacc-compatible parsing
system grmtools in Rust which we use for our experiments. Including
associated libraries for LR table generation and so on, grmtools is
around 13KLoC. Although intended as a production library, it has
accidentally played a part as a flexible test bed for experimenting
with, and understanding, error recovery algorithms. We added a simple
front-end nimbleparse which produces the output seen in e.g. Figure 1
.
There are two standard problems when evaluating error recovery
algorithms: how to determine if a good job has been done on an
individual example; and how to obtain sufficient examples to get a
wide perspective on an algorithm's performance. Unfortunately,
solutions to these problems are mutually exclusive, since the only
way to tell if a good job has been done on a particular is to
manually evaluate it, which means that it is only practical to use a
small set of input programs. Most papers we are aware of use at most
200 source files (e.g. [5]), with one using a single source file with
minor variants [15]. [4] was the first to use a large-scale corpus of
approximately 60,000 Java source files. Early in the development of
our methodology, we performed some rough experiments which suggested
that statistics only start to stabilise once a corpus exceeds 10,000
source files. We therefore prefer to use a much larger corpus than
most previous studies. We are fortunate to have access to the
Blackbox project [3], an opt-in data collection facility for the
BlueJ editor, which records major editing events (e.g. compiling a
file) and sends them to a central repository. Crucially, one can see
the source code associated with each event. What makes Blackbox most
appealing as a data source is its scale and diversity: it has
hundreds of thousands of users, and a huge collection of source code.
We first obtained a Java 1.5 Yacc grammar and updated it to support
Java 1.7.^6 We then randomly selected source files from Blackbox's
database (following the lead of [27], we selected data from
Blackbox's beginning until the end of 2017-12-31). We then ran such
source files through our Java 1.7 lexer. We immediately rejected
files which didn't lex, to ensure we were dealing solely with parsing
errors^7 (see Section 7.4). We then parsed candidate files with our
Java grammar and rejected any which did parse successfully, since
there is little point running an error recovery algorithm on correct
input. The final corpus consists of 200,000 source files
(collectively a total of 401MiB). Since Blackbox, quite reasonably,
requires each person with access to the source files to register with
them, we cannot distribute the source files directly; instead, we
distribute the (inherently anonymised) identifiers necessary to
extract the source files for those who register with Blackbox.
The size of our corpus means that we cannot manually evaluate repairs
for quality. Instead, we report several other metrics, of which the
number of error locations is perhaps the closest proxy for perceived
quality. However, this number has to be treated with caution for two
reasons. First, it is affected by differences in the failure rate: if
a particular error recovery algorithm cannot repair an entire file
then it may not have had time to find all the 'true' error locations.
Second, the number of error locations only allows relative
comparisons. Although we know that the corpus contains at least
200,000 manually created errors (i.e. at least one per file), we
cannot know if, or how many, files contain more than one error. Since
we cannot know the true number of error locations, we are unable to
evaluate algorithms in an absolute sense.
In order to test hypothesis H1 we ran each error recovery algorithm
against the entire Java corpus, collecting for each file: the time
spent in recovery (in seconds); whether error recovery on the file
succeeded or failed (where failure is due to either the timeout being
exceeded or no repair sequences being found for an error location);
the number of error locations; the cost of repair sequences at each
error location; and the proportion of tokens skipped by error
recovery (i.e. how many delete repairs were applied). We measure the
time spent in error recovery with a monotonic wall-clock timer,
covering the time from when the main parser first invokes error
recovery until an updated parsing stack and parsing index are
returned along with minimum cost repair sequences. The timer is
suspended when normal parsing restarts and resumed if error recovery
is needed again (i.e. the timeout applies to the file as a whole).
We evaluate three main error recovery algorithms: Corchuelo et al.,
CPCT+, and panic mode. Our implementation of Corchuelo et al. is to
some extent a 'best effort' since we have had to fill in several
implementation details ourselves. As per the description, we: use the
same limits on repair sequences (repair sequences can contain at most
3 delete or 4 insert repairs, and cannot span more than 10 tokens in
the input); complete as soon as a single successful repair sequence
is found; and, when no available repair sequence is found, fall back
on panic mode. In addition, we impose the same 0.5s timeout on this
algorithm, as it is otherwise unbounded in length, and sometimes
exhausts available RAM. Panic mode implements the algorithm from
Section 3. We do not report the average cost size for panic mode or
Corchuelo et al. (which falls back on panic mode) since they do not
(always) report repair sequences (see Section 3). Although panic mode
can implicitly delete input from before the error location, we only
include the input it explicitly skips in the proportion of tokens
skipped.
In order to test hypothesis H2, we created a variant of CPCT+ called
CPCT+rev. Instead of selecting from the minimum cost repair sequences
which allow parsing to continue furthest, CPCT+rev selects from those
which allow parsing to continue the least far. This models the worst
case for other members of the Fischer et al. family which
non-deterministically select a single minimum cost repair sequence.
In other words, it allows us to understand how many more errors could
be reported to users of other members of the Fischer et al. family
compared to CPCT+ .
Configuration merging (see Section 5.2) is the most complex part of
CPCT+. To understand whether this complexity leads to better
performance, we created another variant of CPCT+ called CPCT+ DM
which disables configuration merging.
We bootstrap [10] our results 10,000 times to produce 99% confidence
intervals. However, as Figure 12 shows, our distribution is
heavy-tailed, so we cannot bootstrap naively. Instead, we ran each
error recovery algorithm 30 times on each source file; when
bootstrapping we randomly sample one of the 30 values collected
(i.e. our bootstrapped data contains an entry for every file in the
experiment; that entry is one of the 30 values collected for that
file).
All experiments were run on an otherwise unloaded Intel Xeon E3-1240
v6 with 32GiB RAM running Debian 10. We disabled hyperthreading and
turbo boost and ran experiments serially. Our experiments took
approximately 15 days to complete. We used Rust 1.43.1 to compile
grmtools (the Cargo.lock file necessary to reproduce the build is
included in our experimental repository).
6.2 Results
---------------------------------------------------------------------
Mean Median Cost Failure Tokens Error
time (s) time (s) size rate skipped locations
(#) (%) (%) (#)
Corchuelo et 0.042367 0.000335 - 5.54 0.61 374,731
al.
+-0.0000057 +-0.0000007 +-0.004 +-<0.001 +-26
CPCT+ 0.013643 0.000251 1.67 1.63 0.31 435,812
+-0.0000822 +-0.0000003 +-0.001 +-0.017 +-0.001 +-473
Panic mode 0.000002 0.000001 - 0.00 3.72 981,628
+- +- +-<0.001 +-<0.001 +-0
<0.0000001 <0.0000001
CPCT+ DM 0.026127 0.000258 1.63 3.63 0.28 421,897
+-0.0001077 +-0.0000003 +-0.001 +-0.025 +-0.001 +-358
CPCT+rev 0.018374 0.000314 1.77 2.34 0.41 574,979
+-0.0001109 +-0.0000006 +-0.001 +-0.023 +-0.002 +-1104
Figure 11: Summary statistics from running various error recovery
algorithms over a corpus of 200,000 Java files (for all measures,
lower is better). Mean and median times report how much time was
spent in error recovery per file: both figures include files which
exceeded the recovery timeout, so they represent the 'real' times
that users would experience, whether or not all errors are repaired
or not. Cost size reports the mean cost (i.e. the number of insert
and delete repairs) of each error location repaired (this number is
meaningless for Corchuelo et al.- which falls back on panic mode -
and panic mode, since panic mode does not report repair sequences).
The failure rate is the percentage of files which could not be fully
repaired within the timeout. Tokens skipped is the proportion of
input skipped (because of a delete repair). CPCT+rev models the worst
case of non-deterministic Fischer et al. algorithms by reversing the
order of repair ranking (see Section 5.3). CPCT+ DM shows the
performance of CPCT+ if configuration merging is disabled (see
Section 5.2).
---------------------------------------------------------------------
---------------------------------------------------------------------
PIC
Figure 12: A histogram of the time spent in error recovery by CPCT+
for files in our corpus. The x axis shows time (up to the timeout of
0.5s) and the y axis is a logarithmic scale for the number of files.
Error bars represent 99% confidence intervals. As this clearly shows,
the vast majority of files fit in the histogram's first bin; there is
then a gradual decrease until around 0.15s, with a broadly flat
distribution from then until the pronounced peak at the timeout of
0.5s. Figure 17 in the Appendix shows how extending the timeout
increases the number of files which can be successfully recovered.
---------------------------------------------------------------------
Figure 11 shows a summary of the results of our experiment. Comparing
the different algorithms requires care as a higher failure rate tends
to lower the cost size, tokens skipped, and number of error locations
simply because files are not completely parsed. For example, although
Corchuelo et al. reports fewer error locations than CPCT+, that is
probably due to Corchuelo et al.'s higher failure rate; however, as
we shall see in Section 6.3, panic mode's much higher number of error
locations relative to CPCT+ might better be explained by other
factors.
With that caution in mind, the overall conclusions are fairly clear.
CPCT+ is able to repair nearly all input files within the 0.5s
timeout. While panic mode is able to repair every file within the
0.5s timeout, it reports well over twice as many error locations as
CPCT+- in other words, panic mode substantially worsens the cascading
error problem. As well as producing more detailed and accurate
output, CPCT+ has a lower failure rate, median, and mean time than
Corchuelo et al.. The fact that the median recovery time for CPCT+ is
two orders of magnitude lower than its mean recovery time suggests
that only a small number of outliers cause error recovery to take
long enough to be perceptible to humans; this is confirmed by the
histogram in Figure 12. These results strongly validate Hypothesis
H1.
Corchuelo et al.'s poor performance may be surprising, as it produces
at most one (possibly non-minimum cost) repair sequence whereas CPCT+
produces the complete set of minimum cost repair sequences - in other
words, CPCT+ is doing more work, more accurately, and in less time
than Corchuelo et al.. There are three main reasons for Corchuelo et
al.'s poor performance. First, the use of CR Shift 1 causes the
search to miss intermediate nodes that would lead to success being
detected earlier. Second, the heuristics used to stop the search from
going too far (e.g. limiting a repair sequence's number of inserts
and deletes) are not well-suited to a large grammar such as Java's:
the main part of the search often exceeds the timeout, leaving no
time for the fallback mechanism of panic mode to be used. Finally,
Corchuelo et al. lacks configuration merging, causing it to perform
needless duplicate work.
---------------------------------------------------------------------
PIC
Figure 13: A histogram of the number of files with 1-50 error
locations for CPCT+ and CPCT+rev . Note that we exclude: any files
for which an error recovery algorithms did not find a single error
location (either because the timeout was exceeded or no repair
sequences could be found); and a handful of outliers which distort
the full histogram found in Figure 18 in the Appendix. The x axis
shows the number of error locations in a file and the y axis is a
logarithmic scale for the number of files. Error bars represent 99%
confidence intervals. As this histogram shows, the entire
distribution is skewed slightly rightwards by CPCT+rev, showing that
CPCT+rev makes error recovery slightly worse in a number of files
(rather than making error recovery in a few files a lot worse).
---------------------------------------------------------------------
CPCT+ ranks the complete set of minimum cost repair sequences by how
far parsing can continue and chooses from those which allow parsing
to continue furthest. CPCT+rev, in contrast, selects from those which
allow parsing to continue the least far. CPCT+rev shows that the
ranking technique used in CPCT+ substantially reduces the potential
for cascading errors: CPCT+rev leads to 31.93%+-0.289% more error
locations being reported to users relative to CPCT+. We visualise
this in the histogram of Figure 13 which shows all files with 1-50
error locations (a complete histogram can be found in Figure 18 in
the Appendix). Note that files where error recovery did not complete
and no error locations were found (which happens occasionally with
CPCT+rev) are excluded from this histogram (since we know that every
file in the corpus has at least one error), but files where error
recovery did not complete but some error locations were found are
included (since this gives us, at least, a lower bound on the number
of error locations). As Figure 13 shows, the distribution of error
locations in CPCT+ and CPCT+rev is similar, with the latter simply
shifted slightly to the right. In other words, CPCT+rev makes error
recovery slightly worse in a number of files (rather than making
error recovery in a small number of files a lot worse). This strongly
validates Hypothesis H2.
Interestingly, and despite its higher failure rate, CPCT+rev has a
noticeably higher mean cost of repair sequences relative to CPCT+. In
other words, CPCT+rev not only causes more error locations to be
reported, but those additional error locations have longer repair
sequences. This suggests that there is a double whammy from cascading
errors: not only are more error locations reported, but the poorer
quality repair sequences chosen make subsequent error locations
disproportionately harder for the error recovery algorithm to recover
from.
CPCT+ DM shows that configuration merging has a significant effect on
the failure rate, in our opinion justifying both its conceptual
complexity and the less than 100LoC Rust code taken to implement it.
The slowdown in the mean and median time for CPCT+ DM suggests that
configuration merging is particularly effective on files with complex
or numerous errors.
6.3 The impact of skipping input
The number of error locations reported by panic mode is well over
twice that of CPCT+; even given CPCT+ 's higher failure rate relative
to panic mode, this seemed hard to explain. We thus made an
additional hypothesis:
H3
The greater the proportion of tokens that are skipped, the
greater the number of error locations.
The intuition underlying this hypothesis is that, in general, the
user's input is very close to being correct and that the more input
that error recovery skips, the less likely it is to get back to a
successful parse. We added the ability to record the proportion of
tokens skipped as the result of delete repairs during error recovery.
The results in Figure 11 show a general correlation between the
proportion of tokens skipped and the number of error locations (e.g.
CPCT+ skips very little of the user's input; CPCT+rev skips a little
more; and panic mode skips an order of magnitude more). However,
Corchuelo et al. does not obviously follow this pattern: relative to
the other algorithms, its number of error locations does not
correlate with the proportion of input skipped. This is mostly
explained by its high mean time and high failure rate: Corchuelo et
al. tends to fail on files with large numbers of error locations,
underreporting the 'true' number of error locations simply because it
cannot make it all of the way through such files before the timeout.
However, this outlier means that we consider Hypothesis H3 to be only
weakly validated.
7 Using error recovery in practice
Although several members of the Fischer et al. family were
implemented in parsing tools of the day, to the best of our knowledge
none of those implementations have survived. Equally, we are not
aware of any member of the Fischer et al. family which explains how
error recovery should be used or, indeed, if it has any implications
for users at all.
We are therefore forced to treat the following as an open question:
can one sensibly use error recovery in the Fischer et al. family in
practice? In particular, given that the most common way to use LR
grammars is to execute semantic actions as each production is
reduced, what should semantic actions do when parts of the input have
been altered by error recovery? This latter question is important for
real-world systems (e.g. compilers) which can still perform useful
computations (e.g. running a type checker) in the face of syntax
errors.
While different languages are likely to lend themselves to different
solutions, in this section we show that grmtools allows sensible
integration of error recovery in a Rust context. Readers who prefer
to avoid Rust-specific details may wish to move immediately to
Section 8.
7.1 A basic solution
---------------------------------------------------------------------
1%start Expr
2%%
3Expr -> u64:
4 Factor "+" Expr { $1 + $3 }
5 | Factor { $1 }
6 ;
7
8Factor -> u64:
9 Term "*" Factor { $1 * $3 }
10 | Term { $1 }
11 ;
12
13Term -> u64:
14 "(" Expr ")" { $2 }
15 | "INT"
16 {
17 let n = $lexer.span_str($1.unwrap()
18 .span());
19 match s.parse::() {
20 Ok(val) => val as u64,
21 Err(_) => panic!(
22 "{} cannot be represented as a u64",
23 s)
24 }
25 }
26 ;
Figure 14: A naive version of the calculator grammar with semantic
actions on each production. The traditional Yacc %union declaration
is unwieldy in Rust. Instead, grmtools allows each rule to have a
Rust return type associated with it (between '->' and ':'): the
actions of each production in that rule must return values of that
type. $n variables reference the nth symbol in a production (where $1
references the first symbol). If that symbol is a reference to a
rule, a value of that rule's Rust type will be stored in that
variable. If that symbol is a token then the user's input can be
obtained by $lexer.span_str($1.unwrap().span()) (line 17-18). Note
that while this grammar's semantic actions work as expected for
inputs such as '2 + 3 * 4', they will panic if too large a number is
passed (lines 21-24), or if an integer is inserted by error recovery.
Figure 15 shows how to avoid both problems.
---------------------------------------------------------------------
Figure 14 shows a naive grmtools version of the grammar from Figure 2
that can evaluate numeric results as parsing occurs (i.e. given the
input 2 + 3 *4 it returns 14). This grammar should mostly be familiar
to Yacc users: each production has a semantic action (i.e. Rust code
that is executed when the production is reduced); and symbols in the
production are available to the semantic action as pseudo-variables
named $n (a production of n symbols has n pseudo-variables with the
first symbol connected to $1 and so on). A minor difference from
traditional Yacc is that grmtools allows rules to specify a different
return type, an approach shared with other modern parsers such as
ANTLR [22].
A more significant difference relates to the $n pseudo-variables: if
they reference a rule R, then their type is R's return type; if they
reference a token T, then their type is (slightly simplified) Result
. We will explain the reasons for this shortly, but
at this stage it suffices to note that, unless a token was inserted
by error recovery, we can extract tokens by calling $1.unwrap(), and
obtain the actual string the user passed by using the globally
available $lexer.span_str function.
7.2 Can semantic action execution continue in the face of error
recovery?
---------------------------------------------------------------------
1%start Expr
2%avoid_insert "INT"
3%%
4Expr -> Result>:
5 Factor "+" Expr { Ok($1? + $3?) }
6 | Factor { $1 }
7 ;
8
9Factor -> Result>:
10 Term "*" Factor { Ok($1? * $3?) }
11 | Term { $1 }
12 ;
13
14Term -> Result>:
15 '(' Expr ')' { $2 }
16 | 'INT'
17 {
18 let t = $1.map_err(|_|
19 "")?;
20 let n = $lexer.span_str(t.span());
21 match s.parse::() {
22 Ok(val) => Ok(val as u64),
23 Err(_) => {
24 Err(Box::from(format!(
25 "{} cannot be represented as a u64",
26 s
27 )))
28 }
29 }
30 }
31 ;
Figure 15: A more sophisticated version of the grammar from Figure 14
. Each rule now returns a Result type. If an integer is inserted by
error recovery, the Term rule stops evaluation by percolating the Err
value upwards using the '?' operator (which, if the Result-returning
expression it is attached to evaluates to an Err, immediately returns
that error; otherwise it unwraps the Ok); all other rules percolate
such errors upwards similiarly. As a convenience for the user, the
contents of the 'Err' value are changed from a lexeme to a string
explaining why the calculator has not produced a value (lines 18-19).
Note that other token types are unaffected: if error recovery inserts
a bracket, for example, evaluation of the expression continues.
---------------------------------------------------------------------
In Yacc, semantic actions can assume that each symbol in the
production has 'normal' data attached to it (either a rule's value or
the string matching a token; Yacc's error recovery is implicitly
expected to maintain this guarantee) whereas, in our setting,
inserted tokens have a type but no value. Given the input '(2 + 3',
the inserted close bracket is not hugely important, and our
calculator returns the value 5. However, given the input '2 +', CPCT+
finds a single repair sequence [Insert Int]: what should a calculator
do with an inserted integer? Our naive calculator simply panics
(which is roughly equivalent to 'raises an exception and then exits')
in such a situation (the unwrap in Figure 14 on line 17). However,
there are two alternatives to this rather extreme outcome: the
semantic action can assume a default value or stop further execution
of semantic values while allowing parsing to continue. Determining
which is the right action in the face of inserted tokens is
inherently situation specific. We therefore need a pragmatic way for
users to control what happens in such cases.
The approach we take is to allow users to easily differentiate normal
vs. inserted tokens in a semantic action. Pseudo-variables that
reference tokens have (slightly simplified) the Rust type Result
. Rust's Result type^8 is a sum type which represents
success (Ok(...)) or error (Err(...)) conditions. We use the Ok case to
represent 'normal' tokens created from user input and the Err case to
represent tokens inserted by error recovery. Since the Result type is
widely used in Rust code, users can avail themselves of standard
idioms.
For example, we can then alter our calculator grammar to continue
parsing, but stop executing meaningful semantic action code, when an
inserted integer is encountered. We change grammar rules from
returning type T to Result> (where Box
is roughly equivalent to 'can return any type of error'). It is then,
deliberately, fairly easy to use with the Result
type: for tokens whose value we absolutely require, we use Rust's '?'
operator (which passes Ok values through unchanged but returns Err
values to the function caller) to percolate our unwillingness to
continue evaluation upwards. While Box is slightly
verbose, it is a widely understood Rust idiom. Figure 15 shows that
changing the grammar to make use of this idiom requires relatively
little extra code.
7.3 Avoiding insert repairs when possible
Although we now have a reasonable mechanism for dealing with inserted
tokens, there are cases where we can bypass them entirely. For
example, consider the input '2 + + 3', which has two repair sequences
[Delete +], [Insert Int]: evaluation of the expression can continue
with the former repair sequence, but not the latter. However, as
presented thus far, these repair sequences are ranked equally and one
non-deterministically selected.
We therefore added an optional declaration %avoid_insert to grmtools
which allows users to specify those tokens which, if inserted by
error recovery, are likely to prevent semantic actions from
continuing execution. In practise, this is synonymous with those
tokens whose values (and not just their types) are important. In the
calculator grammar only the INT token satisfies this criteria, so we
add %avoid_insert "INT" to the grammar. We then make a simple change
to the repair sequence ranking of Section 5.3 such that the final
list of repair sequences is sorted with inserts of such tokens at the
bottom of the list. In our case, this means that we always select
Delete + as the repair sequence to apply to the input '2 + + 3'
(i.e. the Insert Int repair sequence is always presented as the
second option).
7.4 Turning lexing errors into parsing errors
---------------------------------------------------------------------
(a) . "ERRORTOKEN"
(b) ErrorRule -> ():
"ERRORTOKEN" { }
;
Parsing error at line 1 column 3. Repair sequences found:
(c) 1: Insert +, Delete @
2: Insert *, Delete @
3: Delete @, Delete 3
Result: 9
Figure 16: A simple way of turning lexing errors into parsing errors
in grmtools. First, we add a ERRORTOKEN token type, which matches
otherwise invalid input, to the end of the Lex file (a). Second, we
add a rule ErrorRule to the Yacc grammar referencing ERRORTOKEN (b).
Note that ErrorRule must not be referenced by any other rule in the
Yacc grammar. With those two steps complete, input with lexing errors
such as '2 @ 3 + 4' invokes normal error recovery (c).
---------------------------------------------------------------------
In most traditional parsing systems, lexing errors are distinct from
parsing errors: only files which can be fully lexed can be parsed.
This is confusing for users, who are often unaware of the distinction
between these two phases. To avoid this, we lightly adapt the idea of
error tokens from incremental parsing [31, p. 99]. In essence, any
input which cannot be lexed is put into a token whose type is not
referenced in the normal grammar. This guarantees that all possible
input lexes without error and, when the parser encounters an error
token, normal error recovery commences^9 .
A basic, but effective, version of this requires no special support
from grmtools (see Figure 16). First, we add a new toke type to the
lexer that matches each otherwise invalid input character. Second,
since grmtools requires that all tokens defined in the lexer are
referenced in the grammar, we add a dummy rule to the grammar that
references the token (making sure not to reference this rule
elsewhere in the grammar). These two steps are sufficient to ensure
that users always see the same style of error messages, and the same
style of error recovery, no matter whether they make a lexing or a
parsing error.
8 Threats to validity
Although it might not be obvious at first, CPCT+ is
non-deterministic, which can lead to different results from one run
to the next. The root cause of this problem is that multiple repair
sequences may have identical effects up to N[try] tokens, but cause
different effects after that value. By running each file through each
error recovery multiple times and reporting confidence intervals, we
are able to give a good - though inevitably imperfect - sense of the
likely variance induced by this non-determinism.
Our implementation of Corchuelo et al. is a 'best effort'. The
description in the paper is incomplete in places and, to the best of
our knowledge, the accompanying source code is no longer available.
We thus may not have faithfully implemented the intended algorithm.
Blackbox contains an astonishingly large amount of source code but
has two inherent limitations. First, it only contains Java source
code. This means that our main experiment is limited to one grammar:
it is possible that our techniques do not generalise beyond the Java
grammar (though, as Appendix A suggests, our techniques do appear to
work well on other grammars). Although [4, p. 109] suggests that
different grammars make relatively little difference to the
performance of such error recovery algorithms, we are not aware of an
equivalent repository for other language's source code. One solution
is to mutate correct source files (e.g. randomly deleting tokens),
thus obtaining incorrect inputs which we can later test: however, it
is difficult to uncover and then emulate the numerous, sometimes
surprising, ways that humans make syntax errors, particularly as some
are language specific (though there is some early work in this area [
7]). Second, Blackbox's data comes largely from students, who are
more likely than average to be somewhat novice programmers. It is
clear that novice programmers make some different syntax errors -
and, probably, make some syntax errors more often - relative to
advanced programmers. For example, many of the files with the
greatest number of syntax errors are caused by erroneous fragments
repeated with variants (i.e. it is likely that the programmer wrote a
line of code, copy and pasted it, edited it, and repeated that
multiple times before deciding to test for syntactic validity). It is
thus possible that a corpus consisting solely of programs from
advanced programmers would lead to different results. We consider
this a minor worry, partly because a good error recovery algorithm
should aim to perform well with inputs from users of different
experience levels.
Our corpus was parsed using a Java 1.7 grammar, but some members of
the corpus were almost certainly written using Java 1.8 or later
features. Many - though not all - post-1.7 Java features require a
new keyword: such candidate source files would thus have failed our
initial lexing test and not been included in our corpus. However,
some Java 1.8 files will have made it through our checks. Arguably
these are still a valid test of our error recovery algorithms. It is
even likely that they may be a little more challenging on average,
since they are likely to be further away from being valid syntax than
files intended for Java 1.7.
9 Related work
Error recovery techniques are so numerous that there is no definitive
reference or overview of them. However, [8] contains an overall
historical analysis and [4] an excellent overview of much of the
Fischer et al. family. Both must be supplemented with more recent
works.
The biggest limitation of error recovery algorithms in the Fischer et
al. family (including CPCT+ ) is that they find repairs at the point
that an error is discovered, which may be later in the file than the
cause of the error. Thus even when they successfully recover from an
error, the repair sequence reported may be very different from the
fix the user considers appropriate (note that this is distinct from
the cascading error problem, which our ranking of repair sequences in
Section 5.3 partly addresses). A common, frustrating, example of this
is a missing '}' character in C/Java-like languages. Some approaches
are able to backtrack from the source of the error in order to try
and find more appropriate repairs. However, there are two challenges
to this: first, the cost of maintaining the necessary state to
backtrack slows down normal parsing (e.g. [6] only stores the
relevant state at each line encountered to reduce this cost), whereas
we add no overhead at all to normal parsing; second, the search-space
is so hugely increased that it can be harder to find any repairs at
all [8].
One approach to global error recovery is to use machine learning to
train a system on syntactically correct programs [27]: when a syntax
error is encountered, the resulting model is used to suggest
appropriate global fixes. Although [27] also use data from Blackbox,
their experimental methodology is both stricter - aiming to find
exactly the same repair as the human user applied - and looser - they
only consider errors which can be fixed by a single token, discarding
42% of the data [27, p. 8]) whereas we attempt to fix errors which
span multiple tokens. It is thus difficult to directly compare our
results to theirs. However, by the high bar they have set themselves,
they are able to repair 52% of single-token errors.
As CPCT+rev emphasises, choosing an inappropriate repair sequence
during error recovery leads to cascading errors. The noncorrecting
error recovery approach proposed by [26] explicitly addresses this
weakness, eschewing repairs entirely. When a syntax error is
discovered, noncorrecting error recovery attempts to discover all
further syntax errors by checking whether the remaining input (after
the point an error is detected) is a valid suffix in the language.
This is achieved by creating a recogniser that can identify all valid
suffixes in the language. Any errors identified in the suffix parse
are guaranteed to be genuine syntax errors because they are
uninfluenced by errors in the (discarded) prefix (though this does
mean that some genuine syntax errors are missed that would not have
been valid suffixes at that point in the user's input had the
original syntax error not been present). There seem to be two main
reasons why noncorrecting error recovery has not been adopted. First,
building an appropriate recogniser is surprisingly tricky and we are
not currently aware of one that can handle the full class of LR
grammars (though the full class of LL grammars has been tackled [30
]), though we doubt that this problem is insoluble. Second, as soon
as a syntax error is encountered, noncorrecting error recovery is
unable to execute semantic actions, since it lacks the execution
context they need.
Although one of our paper's aims is to find the complete set of
minimum cost repair sequences, it is unclear how best to present them
to users, leading to questions such as: should they be simplified?
should a subset be presented? and so on. Although rare, there are
some surprising edge cases. For example, the (incorrect) Java 1.7
expression 'x = f(""a""b);' leads to 23,067 minimum cost repair
sequences being found, due to the large number of Java keywords that
are valid in several parts of this expression leading to a
combinatorial explosion: even the most diligent user is unlikely to
find such a volume of information valuable. In a different vein, the
success condition of 'reached an accept' state is encountered rarely
enough that we sometimes forgot that it could happen and were
confused by an apparently unexplained difference in the repair
sequences reported for the same syntax chunk when it was moved from
the end to the middle of a file. There is a body of work which has
tried to understand how best to structure compiler error messages
(normally in the context of those learning to program). However, the
results are hard to interpret: some studies find that more complex
error messages are not useful [20], while others suggest they are [24
]. It is unclear to us what the right approach might be, or how it
could be applied in our context.
The approach of [17] is similar to Corchuelo et al., although the
former cannot incorporate shift repairs. It tries harder than CPCT+
to prune out pointless search configurations [17, p. 12], such as
cycles in the parsing stack, although this leads to some minimum cost
repairs being skipped [2]. A number of interlocking, sophisticated
pruning mechanisms which build on this are described in [4]. These
are significantly more complex than our merging of compatible
configurations: since this gives us acceptable performance in
practise, we have not investigated other pruning mechanisms.
The most radical member of the Fischer et al. family is that of [15]^
10 . This generates repair sequences in the vein of Corchuelo et al.
using the A* algorithm and a precomputed distance table. [15] works
exclusively on the stategraph, assuming that it is unambiguous.
However, Yacc systems allow ambiguous stategraphs and provide a means
for resolving those ambiguities when creating the statetable. Many
real-world grammars (e.g. Lua, PHP) make use of ambiguity resolution.
In an earlier online draft, we created MF, a statetable-based
algorithm which extends CPCT+ with ideas from [15] at the cost of
significant additional complexity. With the benefit of hindsight, we
do not consider MF's relatively small benefits (e.g. reducing the
failure rate by approximately an additional 0.5%) to be worth that
additional complexity.
CPCT+ takes only the grammar and token types into account. However,
it is possible to use additional information, such as nesting
(e.g. taking into account curly brackets) and indentation when
recovering from errors. This has two aims: reducing the size of the
search space (i.e. speeding up error recovery); and making it more
likely that the repairs reported matched the user's intentions. The
most sophisticated approach in this vein we are aware of is that of [
6]. At its core, this approach uses GLR parsing: after a grammar is
suitably annotated by the user, it is then transformed into a
'permissive' grammar which can parse likely erroneous inputs; strings
which match the permissive parts of the grammar can then be
transformed into a non-permissive counterpart. In all practical
cases, the transformed grammar will be ambiguous, hence the need for
generalised parsing. Our use of parent-pointer trees in configuration
merging gives that part of our algorithm a similar feel to GLR
parsing (even though we do not generate ambiguous strings). However,
there are major differences: LR parsers are much simpler than GLR
parsers; and the Fischer et al. family of algorithms do not require
manually annotating, or increasing the size of, the grammar.
A different approach to error recovery is that taken by [23]: rather
than try and recover from errors directly, it reports in natural
language how the user's input caused the parser to reach an error
state (e.g. "I read an open bracket followed by an expression, so I
was expecting a close bracket here"), and possible routes out of the
error (e.g. "A function or variable declaration is valid here"). This
involves significant manual work, as every parser state (1148 in the
Java grammar we use) in which an error can occur needs to be manually
marked up, though the approach has various techniques to lessen the
problem of maintaining messages as a grammar evolves.
Many compilers and editors have hand-written parsers with
hand-written error recovery. Though generally ad-hoc in their
approach, it is possible, with sufficient effort, to make them
perform well. However, this comes at a cost. For example, the
hand-written error recovery routines in the Eclipse IDE are
approximately 5KLoC and are solely for use with Java code: CPCT+ is
approximately 500LoC and can be applied to any LR grammar.
Although error recovery approaches have, historically, been mostly LR
based, there are several non-LR approaches. A full overview is
impractical, though a few pointers are useful. When LL parsers
encounter an error, they generally skip input until a token in the
follow set is encountered (an early example is [29]). Although this
outperforms the simple panic mode of Section 3, it will, in general,
clearly skip more input than CPCT+, which is undesirable. LL parsers
do, however, make it somewhat easier to express grammar-specific
error recovery rules. The most advanced LL approach that we are aware
of is IntelliJ's Grammar-Kit, which allows users to annotate their
grammars for error recovery. Perhaps the most interesting annotation
is that certain rules can be considered as fully matched even if only
a prefix is matched (e.g. a partially completed function is parsed as
if it was complete). It might be possible to add similar ideas to a
successor of CPCT+, though this is more awkward to express in an LR
approach. Error recovery for PEG grammars is much more challenging,
because the non-LL parts of the grammar mean that there is not always
a clearly defined point at which an error is determined to have
occurred. PEG error recovery has thus traditionally required
extensive manual annotations in order to achieve good quality
recovery. [18] tackles this problem by automatically adding many
(though not necessarily all) of the annotations needed for good PEG
error recovery. However, deciding when to add, and when not to add,
annotations is a difficult task and the two algorithms presented have
different trade-offs: the Standard algorithm adds more annotations,
leading to better quality error recovery, but can change the input
language accepted; the Unique algorithm adds fewer annotations,
leading to poorer quality error recovery, but does not affect the
language accepted. The quality of error recovery of the Unique
algorithm, in particular, is heavily dependent on the input grammar:
it works well on some (e.g. Pascal) but less well on others
(e.g. Java). In cases where it performs less well, it can lead to
parsers which skip large portions (sometimes the remainder) of the
input.
While the field of programming languages has largely forgotten the
approach of [1], there are a number of successor works (e.g. [25]).
These improve the time complexity, though none that we are aware of
address the issue of how to present to the user what has been done.
We are not aware of any error recovery algorithms that are formally
verified. Indeed, as shown in this paper, some have serious flaws. We
are only aware of two works which have begun to consider what
correctness for such algorithms might mean: [32] provides a brief
philosophical justification of the need and [12] provides an outline
of an approach. Until such time as someone verifies a full error
recovery algorithm, it is difficult to estimate the effort involved,
or what issues may be uncovered.
10 Conclusions
In this paper we have shown that error recovery algorithms in the
Fischer et al. family can run fast enough to be usable in the real
world. Extending such algorithms to produce the complete set of
minimum cost repair sequences allows parsers to provide better
feedback to users, as well as significantly reducing the cascading
error problem. The CPCT+ algorithm is simple to implement (less than
500LoC in our Rust system) and still has good performance.
Looking to the future, we, perhaps immodestly, suggest that CPCT+
might be 'good enough' to serve as a common representative of the
Fischer et al. family. However, we do not think that it is the
perfect solution. We suspect that, in the future, multi-phase
solutions will be developed. For example, one may use noncorrecting
error recovery (e.g. [26]) to identify syntax errors, and then use a
combination of machine-learning (e.g. [27]) and CPCT+ to discover
those repair sequences that do not lead to additional error locations
being encountered.
Acknowledgements: We are grateful to the Blackbox developers for
allowing us access to their data, and particularly to Neil Brown for
help in extracting a relevant corpus. We thank Edd Barrett for
helping to set up our benchmarking machine and for comments on the
paper. We also thank Carl Friedrich Bolz-Tereick, Sergio Queiroz de
Medeiros, Sarah Mount, Francois Pottier, Christian Urban, and
Naveneetha Vasudevan for comments. This research was funded by the
EPSRC Lecture (EP/L02344X/1) Fellowship.
References
1 Alfred Aho and Thomas G. Peterson. A minimum distance
error-correcting parser for context-free languages. 1:305-312,
December 1972.
2 Eberhard Bertsch and Mark-Jan Nederhof. On failure of the pruning
technique in "error repair in shift-reduce parsers". TOPLAS, 21
(1):1-10, January 1999.
3 Neil C. C. Brown, Michael Kolling, Davin McCall, and Ian Utting.
Blackbox: A large scale repository of novice programmers activity. In
SIGCSE, March 2014.
4 Carl Cerecke. Locally least-cost error repair in LR parsers. PhD
thesis, University of Canterbury, June 2003.
5 Rafael Corchuelo, Jose Antonio Perez, Antonio Ruiz-Cortes, and
Miguel Toro. Repairing syntax errors in LR parsers. TOPLAS,
24:698-710, November 2002.
6 Maartje de Jonge, Lennart C. L. Kats, Eelco Visser, and Emma
Soderberg. Natural and flexible error recovery for generated modular
language environments. TOPLAS, 34(4):15:1-15:50, December 2012.
7 Maartje de Jonge and Eelco Visser. Automated evaluation of syntax
error recovery. In ASE, pages 322-325, September 2012.
8 Pierpaolo Degano and Corrado Priami. Comparison of syntactic
error handling in LR parsers. SPE, 25(6):657-679, June 1995.
9 Joel E. Denny and Brian A. Malloy. The IELR(1) algorithm for
generating minimal LR(1) parser tables for non-LR(1) grammars with
conflict resolution. SCICO, 75(11):943-979, November 2010.
10 Bradley Efron. Bootstrap methods: Another look at the jackknife.
Ann. Statist., 7(1):1-26, January 1979.
11 C. N. Fischer, B. A. Dion, and J. Mauney. A locally least-cost
lr-error corrector. Technical Report 363, University of Wisconsin,
August 1979.
12 Carlos Gomez-Rodriguez, Miguel A. Alonso, and Manuel Vilares.
Error-repair parsing schemata. TCS, 411(7):1121 - 1139, February
2010.
13 Allen I. Holub. Compiler Design in C. Prentice-Hall, 1990.
14 S. C. Johnson. YACC: Yet Another Compiler-Compiler. Technical
Report Comp. Sci. 32, Bell Labs, July 1975.
15 Ik-Soon Kim and Kwangkeun Yi. LR error repair using the A*
algorithm. Acta Inf., 47:179-207, May 2010.
16 Donald Knuth. On the translation of languages from left to
right. Information and Control, 8(6):607-639, December 1965.
17 Bruce J. McKenzie, Corey Yeatman, and Lorraine de Vere. Error
repair in shift-reduce parsers. TOPLAS, 17(4):672-689, July 1995.
18 Sergio Medeiros, Gilney de Azevedo Alvez Junior, and Fabio
Mascarenhas. Syntax error recovery in parsing expression grammars.
May 2019.
19 Sergio Medeiros and Fabio Mascarenhas. Syntax error recovery in
parsing expression grammars. In SAC, pages 1195-1202, April 2018.
20 Marie-Helene Nienaltowski, Michela Pedroni, and Bertrand Meyer.
Compiler error messages: What can help novices? In SIGCSE, pages
168-172, March 2008.
21 David Pager. A practical general method for constructing LR(k)
parsers. Acta Inf., 7(3):249-268, September 1977.
22 Terence Parr. The definitive ANTLR 4 reference. Pragmatic
Bookshelf, January 2013.
23 Francois Pottier. Reachability and error diagnosis in lr(1)
parsers. In CC, pages 88-98, March 2016.
24 James Prather, Raymond Pettit, Kayla Holcomb McMurry, Alani
Peters, John Homer, Nevan Simone, and Maxine Cohen. On novices'
interaction with compiler error messages: A human factors approach.
In ICER, pages 74-82, August 2017.
25 Sanguthevar Rajasekaran and Marius Nicolae. An error correcting
parser for context free grammars that takes less than cubic time. In
LATA, February 2016.
26 Helmut Richter. Noncorrecting syntax error recovery. TOPLAS, 7
(3):478-489, July 1985.
27 Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel,
Abram Hindle, and Jose Nelson Amaral. Syntax and sensibility: Using
language models to detect and correct syntax errors. In SANER, pages
1-11, March 2018.
28 Masaru Tomita. An efficient augmented-context-free parsing
algorithm. Comput. Linguist., 13(1-2):31-46, January 1987.
29 D. A. Turner. Error diagnosis and recovery in one pass
compilers. IPL, 6(4):113-115, August 1977.
30 Arthur van Deudekom, Dick Grune, and Peter Kooiman. Initial
experience with noncorrecting syntax error handling" number =
"ir-339" institution = "vrije universiteit, amsterdam" pages = "11.
Technical report, November 1993.
31 Tim A. Wagner. Practical Algorithms for Incremental Software
Development Environments. PhD thesis, University of California,
Berkeley, March 1998.
32 Vadim Zaytsev. Formal foundations for semi-parsing. In SMR,
pages 313-317, February 2014.
Appendix (not peer-reviewed)
A Curated examples
In this section we show several examples of error recovery using CPCT
+ in different languages, to give some idea of what error recovery
looks like in practise, and to emphasise that the algorithms in this
paper are grammar neutral. All of these examples use the output from
the nimbleparse tool that is part of grmtools.
A.1 Java 7
Example 1 input:
1class C {
2 int x y;
3}
Example 1 output:
Parsing error at line 2 column 9. Repair sequences found:
1: Delete y
2: Insert ,
3: Insert =
Example 2 input:
1class C {
2 void f() {
3 if true {
4 }
5}
Example 2 output:
Parsing error at line 3 column 8. Repair sequences found:
1: Insert (, Shift true, Insert )
Parsing error at line 5 column 2. Repair sequences found:
1: Insert }
Example 3 (taken from [6, p. 10]) input:
1class C {
2 void f() {
3 if (temp.greaterThan(MAX) // missing )
4 fridge.startCooling();
5 }
6}
Example 3 output:
Parsing error at line 4 column 7. Repair sequences found:
1: Insert )
Example 4 (taken from [6, p. 16]) input:
1class C {
2 void methodX() {
3 if (true)
4 foo();
5 }
6 int i = 0;
7 while (i < 8)
8 i=bar(i);
9 }
10 }
11}
Example 4 output:
Parsing error at line 7 column 5. Repair sequences found:
1: Insert {
Parsing error at line 11 column 1. Repair sequences found:
1: Delete }
Example 5 (taken from [19, p. 2]):
1public class Example {
2 public static void main(String[] args) {
3 int n = 5;
4 int f = 1;
5 while(0 < n) {
6 f = f * n;
7 n = n - 1
8 };
9 System.out.println(f);
10 }
11}
Example 5:
Parsing error at line 8 column 5. Repair sequences found:
1: Delete }
2: Insert ;
Parsing error at line 11 column 2. Repair sequences found:
1: Insert }
Example 6:
1class C {
2 void f() {
3 x((((((
4 }
5}
Example 6 output, showing the timeout being exceeded and error
recovery unable to complete:
Parsing error at line 4 column 3. No repair sequences found.
A.2 Lua 5.3
Example 1 input:
1print("Hello World"
Example 1 output:
Parsing error at line 1 column 20. Repair sequences found:
1: Insert )
Example 2 input. Note that '=' in Lua is the assignment operator,
which is not valid in conditionals; and that if/then/else blocks must
be terminated by 'end'.
1function fact (n)
2 if n = 0 then
3 return 1
4 else
5 return n * fact(n-1)
6end
Example 2 output:
Parsing error at line 2 column 8. Repair sequences found:
1: Delete =, Delete 0
2: Insert or, Delete =
3: Insert and, Delete =
4: Insert ==, Delete =
5: Insert ~=, Delete =
6: Insert >=, Delete =
7: Insert <=, Delete =
8: Insert >, Delete =
9: Insert <, Delete =
10: Insert |, Delete =
11: Insert ~, Delete =
12: Insert &, Delete =
13: Insert >>, Delete =
14: Insert <<, Delete =
15: Insert -, Delete =
16: Insert +, Delete =
17: Insert .., Delete =
18: Insert %, Delete =
19: Insert //, Delete =
20: Insert /, Delete =
21: Insert *, Delete =
22: Insert ^, Delete =
Parsing error at line 6 column 4. Repair sequences found:
1: Insert end
Examples 3 and 4 (both derived from the Lua 5.3 reference manual)
show that CPCT+ naturally deals with an inherent ambiguity in Lua's
Yacc grammar involving function calls and assignments (which,
following the Lua specification, is resolved by Yacc in favour of
function calls). Example 3 shows the 'unambiguous' case (i.e. if Lua
forced users to use ';' everywhere, the grammar would have no
ambiguities):
1a = b + c;
2(print or io.write)'done')
Example 3 output:
Parsing error at line 2 column 26. Repair sequences found:
1: Delete )
2: Insert (
Example 4 shows what happens in the 'ambiguous' case (which Lua's
grammar resolves in favour of viewing the code below as a function
call to c):
1a = b + c
2(print or io.write)'done')
Example 4 output:
Parsing error at line 2 column 26. Repair sequences found:
1: Delete )
Example 5 (taken from [19, p. 7]):
1if then print("that") end
Example 5 output:
Parsing error at line 1 column 4. Repair sequences found:
1: Insert [=[]=]
2: Insert ""
3: Insert ...
4: Insert
5: Insert true
6: Insert false
7: Insert nil
8: Insert
A.3 PHP 7.3
Example 1 input:
1function n() {
2 $x = 1
3}
Example 1 output:
Parsing error at line 3 column 1. Repair sequences found:
1: Insert ;
Example 2 input:
1$a = array("foo", "bar");
2$a[0;
Example 2 output:
Parsing error at line 2 column 5. Repair sequences found:
1: Insert ]
Example 3 input:
1class X {
2 function a($x) {
3 if $x {
4 }
5}
Example 3 output:
Parsing error at line 3 column 12. Repair sequences found:
1: Insert (, Shift $x, Insert )
Parsing error at line 5 column 2. Repair sequences found:
1: Insert }
---------------------------------------------------------------------
PIC
Figure 17: A histogram showing the effect of increasing the timeout
from 0.5s to 2s (see Figure 12 for the histogram showing a timeout of
0.5s). Increasing the timeout from 0.5s to 2s lowers the failure rate
from 1.63%+-0.017% to 1.02%+-0.016%, with a slowly decreasing number of
files succeeding as the timeout increases. Although we used a timeout
of 0.5s on the basis that we felt most users would tolerate such a
delay, others may wish to pick a shorter or longer timeout depending
on their perception of their users tolerance of delay vs. tolerance
of failed error correction.
---------------------------------------------------------------------
---------------------------------------------------------------------
PIC
Figure 18: The full histogram of the number of error locations. The
small number of outliers obscures the main bulk of the data - see
Figure 13 for the truncated version.
---------------------------------------------------------------------
^1[21] can over-merge states when conflicts occur [9, p. 3] (i.e.
when Yacc uses precedence rules to turn an ambiguous grammar into an
unambiguous LR parser). Since our error recovery approach operates
purely on the statetable, it should work correctly with other merging
approaches such as that of [9].
^2Note that step 2 in Holub causes valid repairs to be missed: while
it is safe to ignore the top element of the parsing stack on the
first iteration of the algorithm, as soon as one token is skipped,
one must check all elements of the parsing stack. Our description
simply drops step 2 entirely.
^3It is trivial to extend this to variable token costs if desired,
and our implementation supports this. However, it is unclear whether
non-uniform token costs are useful in practise [4, p.96].
^4The problem, and the basis of a fix, derive from [15, p. 12],
though their suggestion suffers from the same problem as CR Shift 1.
^5An ordered hashset preserves insertion order, and thus allows
list-like integer indexing as well as hash-based lookups.
^6Unfortunately, changes to the method calling syntax in Java 1.8
mean that it is an awkward, though not impossible, fit for an LR(1)
formalism such as Yacc, requiring substantial changes to the current
Java Yacc grammar. We consider the work involved beyond that useful
for this paper.
^7Happily, this also excludes files which can't possibly be Java
source code. Some odd things are pasted into text editors.
^8Equivalents are found in several other languages: Haskell's Either;
O'Caml's result; or Scala's Either.
^9Note that this is an opt-in feature: it was not enabled for the
experiments in Section 6.
^10In an earlier online draft of this paper we stated that this
algorithm has a fundamental flaw. We now believe this was due to us
incorrectly assuming that the 'delete' optimisation of Corchuelo et
al. applied to [15]. We apologise to the authors for this mistake.
Internal[mail] info@soft-dev.org