[HN Gopher] Don't Panic! Better, Fewer, Syntax Errors for LR Par...
       ___________________________________________________________________
        
       Don't Panic! Better, Fewer, Syntax Errors for LR Parsers
        
       Author : matt_d
       Score  : 53 points
       Date   : 2020-07-15 19:51 UTC (3 hours ago)
        
 (HTM) web link (soft-dev.org)
 (TXT) w3m dump (soft-dev.org)
        
       | bollu wrote:
       | Awesome. I love good research into parsing, since parsing is
       | famously "the solved problem that isn't":
       | https://tratt.net/laurie/blog/entries/parsing_the_solved_pro...
       | 
       | I hope to study this paper well (which is by the author of the
       | above post) and understand what their algorithm is. I feel that
       | good error messages make or break a language.
       | 
       | Here is the arxiv link of the same paper:
       | https://arxiv.org/pdf/1804.07133.pdf
        
       | nickmqb wrote:
       | I have a somewhat controversial opinion on this.
       | 
       | This paper, like some others that I've seen, treats parsing as an
       | academic/algorithmic topic. Given some token stream we'd like to
       | minimize some error criteria. However, these papers ignore the
       | fact that people don't write token streams: they write code that
       | is formatted using whitespace. I don't know any programmer that
       | writes their source code on a single line. On the contrary,
       | pretty much every programmer formats their source code to some
       | reasonable (if not complete) degree. In other words: parsing
       | error recovery is not primarly an algorithmic problem: it's a
       | usability problem.
       | 
       | Newlines and indentation are a source of extra information that
       | we can use to infer what the programmer meant. Why throw all that
       | away during tokenization? That's crazy! We can totally use it for
       | error recovery/error message generation.
       | 
       | I decided to play around with this idea in the design my
       | programming language Muon [1]. The language uses redundant
       | significant whitespace for parser error recovery. This simple
       | approach by itself has turned out to work surprisingly well! In
       | my admittedly biased opinion, it frequently surpasses the error
       | recovery quality in mature compilers such as the C# compiler,
       | which has had tons of effort poured into error recovery.
       | 
       | Of course, a purely whitespace based recovery scheme is not
       | perfect: there are rough edges, like having to deal with tabs vs
       | spaces, and recovering inside a line is not usually possible. But
       | the fact that such a simple approach has led to good results
       | makes me think this would be a great area for future research,
       | that can perhaps combine the best of both worlds here.
       | 
       | [1] https://github.com/nickmqb/muon
        
         | estebank wrote:
         | I think you hit the nail right in the head, but I would say
         | that whitespace is only _one_ source of signal. Lately I 've
         | become convinced that languages _need_ redundancy in their
         | grammar to help with error recovery. Having a more flexible
         | syntax actually _hurts_ usability because valid but
         | semantically incorrect code can 't be caught early enough and
         | makes it much harder to give friendly errors. rustc uses
         | whitespace for error recovery in several places, although the
         | most prevalent is checking whether two different elements are
         | on different lines, like when detecting a `:` when a statement
         | ending `;` was actually meant[1].
         | 
         | [1]: https://play.rust-
         | lang.org/?version=stable&mode=debug&editio...
        
           | pjungwir wrote:
           | > languages need redundancy in their grammar
           | 
           | IMO extra redundancy makes Greek much easier to read than
           | Latin. The upfront learning is a bit more effort, but then
           | _reading_ it is much easier. Unlike Latin it has articles
           | (which are declined and agree with the nouns /adjectives),
           | and the endings on verbs & nouns have way less ambiguity. I
           | can certainly see how the same would apply to programming
           | languages.
        
         | thechao wrote:
         | Pseudo-regular parsing combined with hand-written pratt-parsers
         | is a nice sweet-spot: easy to read, _very robust_ , and produce
         | great error messages. The technique was, I am told, moderately
         | popular in the 80s and 90s. (My familiarity with them comes
         | from a language called Spad.)
        
         | dan-robertson wrote:
         | I sort-of do write source code in one long line: I use an
         | autoformatter so when I want to insert something, I usually
         | just write it in one long line and hit save. If I have a syntax
         | error the line won't get formatted and then I have to stare at
         | it.
         | 
         | That said, I think whitespace-based error recovery is a good
         | idea and would probably still help in my case.
        
       | cornstalks wrote:
       | I've thought about doing something like this when using
       | matching/parsing with derivatives[1]. When the graph collapses to
       | [?] (indicating an error), you could back up one character, then
       | ask the graph "what are the next valid token(s)?" This would give
       | you a list of possible things to insert (`,` and `=` for the
       | example given in the article, though naively implementing this
       | would also suggest stuff like `;`).
       | 
       | But this article takes it a step further and suggests deletions,
       | too. That's actually really cool. My biggest gripe with people
       | making DSLs is the crappy error messages you get when you have a
       | syntax error (GCL, I'm looking at you). Hopefully research like
       | this will improve the lives of those of us that have to use DSLs.
       | 
       | [1]: http://matt.might.net/articles/parsing-with-derivatives/
        
       | estebank wrote:
       | > A more grammar-specific variation of this idea is to skip input
       | until a pre-determined synchronisation token (e.g. ';' in Java)
       | is reached [8 , p. 3], or to try inserting a single
       | synchronisation token. Such strategies are often unsuccessful,
       | leading to a cascade of spurious syntax errors (see Figure 1 for
       | an example). Programmers quickly learn that only the location of
       | the first error in a file - not the reported repair, nor the
       | location of subsequent errors - can be relied upon to be
       | accurate.                 C.java:2: error: ';' expected
       | int x y;               ^        C.java:2: error: <identifier>
       | expected          int x y;                 ^
       | 
       | This is a common problem, but the solution here for the example
       | shown is for the parser _not_ to attempt to recover that
       | statement and _ignore_ everything until it reaches the
       | "synchronization point" (I call them "landmarks"), which sounds
       | like what their were arguing for at first. Doing that would make
       | the AST be marked with a node along the lines of "some int named
       | x with errors", and would completely _skip_ everything between
       | the `x` and the `;`. Because an error has been reported, there
       | will be no output binary, but the second error will _not_ be
       | emitted.
       | 
       | Beyond the parser, the AST needs to keep around a lot of extra
       | information for error recovery that would otherwise not be
       | needed. An example from rustc[1] (what I'm familiar with) is
       | structs and struct variants that have had some parse error. We
       | recover the parse and keep a node for the variant in the AST,
       | which means that later passes of the compiler can type and borrow
       | check, but if it finds a _use_ that is either missing or _adding_
       | fields we _do not_ emit an error for it under the assumption that
       | the parse error caused the field being used to not be captured in
       | the AST node for the variant. This can have quite dramatic impact
       | on the quantity of errors being emitted.
       | 
       | [1]: https://github.com/rust-
       | lang/rust/blob/7e11379f3b4c376fbb9a6...
        
       | emmanueloga_ wrote:
       | Most grammars define a very specific starting point for
       | languages. This is fine of course, but what if we want to parse,
       | say, only an expression, only a function body, etc?
       | 
       | I feel most parsing APIs tend to expose a single method,
       | something like:                   parser.parse(source)
       | 
       | ... which is kind of a limited API to do the things I ask about
       | above.
       | 
       | With a more flexible API, an structure aware editor [1] could
       | keep different areas of the buffer separated, and know if the
       | user is typing a function body, a variable definition, etc, and
       | only parse that part w/o breaking the rest of the parser progress
       | (is this a good idea? I think so but I don't think most editors
       | work like this, so maybe not :-)
       | 
       | --
       | 
       | An area of parser APIs that I've recently seen trouble with is
       | lack of metadata on the productions. I'd like to be able to do
       | something like:                   production.meta() => { :line
       | 23, :column 10 }
       | 
       | ... or something like that. This is very useful while error
       | reporting, for instance. I was trying to find a parser for RDF
       | Turtle that would do this, but couldn't find any!
       | 
       | 1: https://en.wikipedia.org/wiki/Structure_editor
        
       | tzs wrote:
       | In the first figure, it gives "int x y;" and says that the new
       | algorithm gives the complete set of minimal cost repair
       | sequences, which is in this case "Delete y", "Insert ,", and
       | "Insert =".
       | 
       | I haven't read the article, so maybe this is covered later, but
       | why isn't "Delete x" included?
        
         | laszlokorte wrote:
         | Just a guess: "int x" could already be comsumed so there is no
         | need to change anything about it. It's only the next token that
         | can not be processed.
        
         | [deleted]
        
       | ltratt wrote:
       | Co-author here. If you want to quickly play with this on the
       | command-line with your favourite Yacc grammar, a simple way is to
       | use nimbleparse https://crates.io/crates/nimbleparse ('cargo
       | install nimbleparse' should do the trick; though note that you
       | will probably need to munge your lexer a bit). You can see this
       | in use in the paper itself: the examples at the end are direct
       | output from nimbleparse
       | https://github.com/softdevteam/error_recovery_paper/tree/mas....
       | Some example grammars are at
       | https://github.com/softdevteam/error_recovery_paper/tree/mas...
       | if you want to get going quickly.
       | 
       | If instead you want to use the full set of Rust libraries, a good
       | place to start is
       | https://softdevteam.github.io/grmtools/master/book/quickstar....
        
       | choeger wrote:
       | That is a very well-written paper and an interesting topic! Kudos
       | to the authors! I'd really like this to turn into a thesis and
       | further research.
       | 
       | Edit: it also follows the de-facto standard for the naming of cs-
       | papers, which demands a pun followed by a quick explanation.
       | Brilliant!
       | 
       | I also just saw that it might be too late for the thesis part. A
       | pity.
        
         | MaxBarraclough wrote:
         | > the de-facto standard for the naming of cs-papers, which
         | demands a pun followed by a quick explanation
         | 
         | I'm reminded of the following wordplay. (I'm ashamed to say I
         | was rather slow to catch the pun when I first heard it.)
         | 
         |  _The aim of the semantic web is to save the world._
        
       ___________________________________________________________________
       (page generated 2020-07-15 23:00 UTC)