[HN Gopher] Don't Panic! Better, Fewer, Syntax Errors for LR Par... ___________________________________________________________________ Don't Panic! Better, Fewer, Syntax Errors for LR Parsers Author : matt_d Score : 53 points Date : 2020-07-15 19:51 UTC (3 hours ago) (HTM) web link (soft-dev.org) (TXT) w3m dump (soft-dev.org) | bollu wrote: | Awesome. I love good research into parsing, since parsing is | famously "the solved problem that isn't": | https://tratt.net/laurie/blog/entries/parsing_the_solved_pro... | | I hope to study this paper well (which is by the author of the | above post) and understand what their algorithm is. I feel that | good error messages make or break a language. | | Here is the arxiv link of the same paper: | https://arxiv.org/pdf/1804.07133.pdf | nickmqb wrote: | I have a somewhat controversial opinion on this. | | This paper, like some others that I've seen, treats parsing as an | academic/algorithmic topic. Given some token stream we'd like to | minimize some error criteria. However, these papers ignore the | fact that people don't write token streams: they write code that | is formatted using whitespace. I don't know any programmer that | writes their source code on a single line. On the contrary, | pretty much every programmer formats their source code to some | reasonable (if not complete) degree. In other words: parsing | error recovery is not primarly an algorithmic problem: it's a | usability problem. | | Newlines and indentation are a source of extra information that | we can use to infer what the programmer meant. Why throw all that | away during tokenization? That's crazy! We can totally use it for | error recovery/error message generation. | | I decided to play around with this idea in the design my | programming language Muon [1]. The language uses redundant | significant whitespace for parser error recovery. This simple | approach by itself has turned out to work surprisingly well! In | my admittedly biased opinion, it frequently surpasses the error | recovery quality in mature compilers such as the C# compiler, | which has had tons of effort poured into error recovery. | | Of course, a purely whitespace based recovery scheme is not | perfect: there are rough edges, like having to deal with tabs vs | spaces, and recovering inside a line is not usually possible. But | the fact that such a simple approach has led to good results | makes me think this would be a great area for future research, | that can perhaps combine the best of both worlds here. | | [1] https://github.com/nickmqb/muon | estebank wrote: | I think you hit the nail right in the head, but I would say | that whitespace is only _one_ source of signal. Lately I 've | become convinced that languages _need_ redundancy in their | grammar to help with error recovery. Having a more flexible | syntax actually _hurts_ usability because valid but | semantically incorrect code can 't be caught early enough and | makes it much harder to give friendly errors. rustc uses | whitespace for error recovery in several places, although the | most prevalent is checking whether two different elements are | on different lines, like when detecting a `:` when a statement | ending `;` was actually meant[1]. | | [1]: https://play.rust- | lang.org/?version=stable&mode=debug&editio... | pjungwir wrote: | > languages need redundancy in their grammar | | IMO extra redundancy makes Greek much easier to read than | Latin. The upfront learning is a bit more effort, but then | _reading_ it is much easier. Unlike Latin it has articles | (which are declined and agree with the nouns /adjectives), | and the endings on verbs & nouns have way less ambiguity. I | can certainly see how the same would apply to programming | languages. | thechao wrote: | Pseudo-regular parsing combined with hand-written pratt-parsers | is a nice sweet-spot: easy to read, _very robust_ , and produce | great error messages. The technique was, I am told, moderately | popular in the 80s and 90s. (My familiarity with them comes | from a language called Spad.) | dan-robertson wrote: | I sort-of do write source code in one long line: I use an | autoformatter so when I want to insert something, I usually | just write it in one long line and hit save. If I have a syntax | error the line won't get formatted and then I have to stare at | it. | | That said, I think whitespace-based error recovery is a good | idea and would probably still help in my case. | cornstalks wrote: | I've thought about doing something like this when using | matching/parsing with derivatives[1]. When the graph collapses to | [?] (indicating an error), you could back up one character, then | ask the graph "what are the next valid token(s)?" This would give | you a list of possible things to insert (`,` and `=` for the | example given in the article, though naively implementing this | would also suggest stuff like `;`). | | But this article takes it a step further and suggests deletions, | too. That's actually really cool. My biggest gripe with people | making DSLs is the crappy error messages you get when you have a | syntax error (GCL, I'm looking at you). Hopefully research like | this will improve the lives of those of us that have to use DSLs. | | [1]: http://matt.might.net/articles/parsing-with-derivatives/ | estebank wrote: | > A more grammar-specific variation of this idea is to skip input | until a pre-determined synchronisation token (e.g. ';' in Java) | is reached [8 , p. 3], or to try inserting a single | synchronisation token. Such strategies are often unsuccessful, | leading to a cascade of spurious syntax errors (see Figure 1 for | an example). Programmers quickly learn that only the location of | the first error in a file - not the reported repair, nor the | location of subsequent errors - can be relied upon to be | accurate. C.java:2: error: ';' expected | int x y; ^ C.java:2: error: <identifier> | expected int x y; ^ | | This is a common problem, but the solution here for the example | shown is for the parser _not_ to attempt to recover that | statement and _ignore_ everything until it reaches the | "synchronization point" (I call them "landmarks"), which sounds | like what their were arguing for at first. Doing that would make | the AST be marked with a node along the lines of "some int named | x with errors", and would completely _skip_ everything between | the `x` and the `;`. Because an error has been reported, there | will be no output binary, but the second error will _not_ be | emitted. | | Beyond the parser, the AST needs to keep around a lot of extra | information for error recovery that would otherwise not be | needed. An example from rustc[1] (what I'm familiar with) is | structs and struct variants that have had some parse error. We | recover the parse and keep a node for the variant in the AST, | which means that later passes of the compiler can type and borrow | check, but if it finds a _use_ that is either missing or _adding_ | fields we _do not_ emit an error for it under the assumption that | the parse error caused the field being used to not be captured in | the AST node for the variant. This can have quite dramatic impact | on the quantity of errors being emitted. | | [1]: https://github.com/rust- | lang/rust/blob/7e11379f3b4c376fbb9a6... | emmanueloga_ wrote: | Most grammars define a very specific starting point for | languages. This is fine of course, but what if we want to parse, | say, only an expression, only a function body, etc? | | I feel most parsing APIs tend to expose a single method, | something like: parser.parse(source) | | ... which is kind of a limited API to do the things I ask about | above. | | With a more flexible API, an structure aware editor [1] could | keep different areas of the buffer separated, and know if the | user is typing a function body, a variable definition, etc, and | only parse that part w/o breaking the rest of the parser progress | (is this a good idea? I think so but I don't think most editors | work like this, so maybe not :-) | | -- | | An area of parser APIs that I've recently seen trouble with is | lack of metadata on the productions. I'd like to be able to do | something like: production.meta() => { :line | 23, :column 10 } | | ... or something like that. This is very useful while error | reporting, for instance. I was trying to find a parser for RDF | Turtle that would do this, but couldn't find any! | | 1: https://en.wikipedia.org/wiki/Structure_editor | tzs wrote: | In the first figure, it gives "int x y;" and says that the new | algorithm gives the complete set of minimal cost repair | sequences, which is in this case "Delete y", "Insert ,", and | "Insert =". | | I haven't read the article, so maybe this is covered later, but | why isn't "Delete x" included? | laszlokorte wrote: | Just a guess: "int x" could already be comsumed so there is no | need to change anything about it. It's only the next token that | can not be processed. | [deleted] | ltratt wrote: | Co-author here. If you want to quickly play with this on the | command-line with your favourite Yacc grammar, a simple way is to | use nimbleparse https://crates.io/crates/nimbleparse ('cargo | install nimbleparse' should do the trick; though note that you | will probably need to munge your lexer a bit). You can see this | in use in the paper itself: the examples at the end are direct | output from nimbleparse | https://github.com/softdevteam/error_recovery_paper/tree/mas.... | Some example grammars are at | https://github.com/softdevteam/error_recovery_paper/tree/mas... | if you want to get going quickly. | | If instead you want to use the full set of Rust libraries, a good | place to start is | https://softdevteam.github.io/grmtools/master/book/quickstar.... | choeger wrote: | That is a very well-written paper and an interesting topic! Kudos | to the authors! I'd really like this to turn into a thesis and | further research. | | Edit: it also follows the de-facto standard for the naming of cs- | papers, which demands a pun followed by a quick explanation. | Brilliant! | | I also just saw that it might be too late for the thesis part. A | pity. | MaxBarraclough wrote: | > the de-facto standard for the naming of cs-papers, which | demands a pun followed by a quick explanation | | I'm reminded of the following wordplay. (I'm ashamed to say I | was rather slow to catch the pun when I first heard it.) | | _The aim of the semantic web is to save the world._ ___________________________________________________________________ (page generated 2020-07-15 23:00 UTC)