[HN Gopher] Just Write the Parser ___________________________________________________________________ Just Write the Parser Author : matt_d Score : 106 points Date : 2020-10-20 14:44 UTC (1 days ago) (HTM) web link (tiarkrompf.github.io) (TXT) w3m dump (tiarkrompf.github.io) | [deleted] | wrs wrote: | One reason to use a parser generator for a _new_ language is that | it will tell you that your language is inherently ambiguous or | otherwise broken in a way you didn't realize. A hand-coded parser | tends to just codify your bad assumptions. | | It can actually be a good idea to maintain a YACC (or other) | grammar for your language just to run the tool as a kind of | "grammar linter", even if you're going to write the parser by | hand. | epistasis wrote: | There's been a ton of code that I've written that in retrospect, | I should never have written. | | But I've never regretted when I wrote a parser. Maybe because it | takes such a large activation energy to get over the hump and | actually do it, that I only do it when absolutely necessary. But | it always seems easier and more useful than I thought before | doing it. | | Now that this post has prompted that realization, I wonder if it | will stay true... | jtolmar wrote: | I'd highly recommend looking at a few toy parser projects, like | this and Crafting Interpreters (which are both recursive decent, | with very different styles), writing however much of a typical | parser you think makes sense "for fun," and then saving that | somewhere you can get at it. | | Writing a parser from scratch is kind of an obnoxious hurdle, but | if you have some code you're already familiar with that you can | copy-paste in and modify to your purposes, that's a lot more | accessible. | | (Also hats off to this author for making a parser so damn terse.) | Const-me wrote: | Did something similar a few times when I needed that. | | Here's an open source example: https://github.com/Const- | me/vis_avs_dx/tree/master/avs_dx/Dx... It's mostly in C++, just a | few offline pieces in C# in T4 templates (they produce C++ code). | tiarkrompf wrote: | Author here - happy to answer questions, as always. | | Thanks also for feedback on the format of the article. It's a bit | of an experiment on how to present dense information effectively. | Some more rationale here: | https://tiarkrompf.github.io/notes/?/octopus-notes/ | musicale wrote: | What's the best way to handle unary "-" and other unary | operators? | | I'd like to be able to write expressions like "-2^-4" or "a cos | b + a sin b". | LeonB wrote: | Octopus notes is an interesting idea. Thanks for the link on | it. | | And I do like to write a lot of little parsers. I worked with | someone once who would somehow transform every business problem | into "we need to write a compiler". Sounds crazy but he | achieved brilliant results and has gone on to great things. | tofflos wrote: | > Enter an expression on the left and see the parse tree change! | | I don't get it. The left input field is uneditable in Firefox, | Edge and Chrome. Navigation is broken across the board. Even | copying and pasting the quote above somehow included most of the | page even though I only selected that one sentence. | | This looks very cool and I would love to try it out! :( | blcArmadillo wrote: | Yeah I think something is broken because at least in chrome | when you try to type in the left box the following exception | keeps being thrown in the console VM463:82 | Uncaught TypeError: Cannot read property 'startContainer' of | undefined at HTMLPreElement.eval (eval at run (eval | at runScriptElement (octopus-2.js:525)), <anonymous>:82:40) | tiarkrompf wrote: | I just pushed a fix for Chrome (apparently it doesn't report | affected selection ranges for input events). Hope it works | now! | tofflos wrote: | Thanks! It works for me now in Edge (Chromium). | LeonB wrote: | it works for me now on Chrome. | prezjordan wrote: | Aside (sorry!) but https://tiarkrompf.github.io/notes is really | cool and it's fun to see this format | jraph wrote: | Wow, thank you very much for this link! | | This is very inspiring. | | First, there is an impressive overlap with what I like and what | I have been doing. The notes on the piano, the parser, on | GraphViz and on React all hit very close if not exactly things | I've thought about or programmed. | | Second, the means of presenting things is awesome too. I get | lost very easily so there are probably ways to improve but this | is very clever and interesting. The creativity and pedagogy | here is incredible at all levels. | | I am very impressed. | | I need to keep an eye on this. I wish there was an RSS feed. | I'll contact him. | tiarkrompf wrote: | Thanks - very glad to hear. RSS should be doable | mamcx wrote: | I like the bit about CPS, but wish it show how do the last part | without having the async machinery in the host language. | | What I have a hard time to get is how implement fully delimited | continuations. All the examples I know reuse the machinery of | the host so is hard to translate (in my case, to Rust) | brundolf wrote: | The only annoying thing is that the scroll bar is in the middle | of my screen. I wish they'd max-widthed the text while leaving | the scrollable element full-width | ofrzeta wrote: | The other annoying thing is the manipulation of the browser | history :) In my opinition it should keet its hands off it. | awinter-py wrote: | those scroll-matched curly braces are something | eyelidlessness wrote: | Considering that something as simple and limited as JSON has been | a source of security vulnerabilities from ambiguities in the | spec, thanks but nope. Declarative definitions of a grammar help | identify those ambiguities... and if the grammar is defined | separately from its interpretation, it opens the door to | invisible ambiguities that can become another vector for | vulnerabilities. | | Unless you have provable implementations (great if you do!), | declarative grammars meaningfully limit points of failure. | ben509 wrote: | > Why simpler is better and why you don't need a parser | generator. | | As far as I can see, this isn't fully answered, unless the claim | is strictly limited to the question of need. | | In my case, I certainly _want_ a parser generator. I 'm working | on a language, and I did a very early version using a hand-rolled | recursive descent parser. | | Then I realized I wanted a syntax that was human friendly, so I | graduated to megaparsec. That worked, it's an elegant way to | describe a parser, but as a language gets complex you have to | work in implicit orderings (magic, frankly) to handle the limits | of recursive descent. | | Finally, I realized I needed a spec and an implementation in sync | with that spec, so I went with BNFC[2], and it even generates | some pretty documentation[1] for me. | | This is the classic debate over using a domain specific language | vs. a general purpose language, so there's no hard answer one way | or the other. For me, the deciding factor was I didn't want to | write, test, etc. my own parser, and I think there being a | canonical way for others to parse my language (in other languages | as well) is helpful. | | [1]: https://tenet-lang.org/spec-syn.html | | [2]: https://bnfc.digitalgrammars.com/ | hardwaregeek wrote: | How do you handle good error messages or error recovery? I've | played around with parser generators but I never figured out | how to do either in a satisfactory fashion. Granted, my hand | written parser doesn't do error recovery very well either. | dfabulich wrote: | Right below the line you quote, the author makes three | arguments: | | > * It's highly instructive, in a way that using a parser | generator is not. To quote Feynman: "What I cannot create, I do | not understand" | | > * It's an important skill: most real-world compilers use | hand-written parsers because they provide more control over | error handling, significant whitespace, etc. | | > * It's not actually difficult! | | The exercises themselves are meant to justify these arguments. | | I think the exercises conclusively demonstrate that #1 and #3 | are true. If you can easily follow these exercises, and if, by | the end, you feel like you've learned something useful, then it | follows that hand-rolling a parser is instructive and not too | difficult. | | I strongly agree with the author that if you're developing a | language for the first time, you shouldn't use a parser | generator at first. You should hand-roll a parser first, and | then, when you see the limitations of your hand-rolled parser, | adopt a parser generator, now with full understanding of what | the generator is doing for you (and not doing for you). | | As for #2, the article demonstrates how to do error handling in | a manner that would be a head-scratcher in Yacc. | | You claim here that it was easier to develop a "human friendly" | syntax with megaparsec than it was to do that in a hand-rolled | recursive-descent parser. That could be true, but that's not my | experience. To be convincing, you'd need to do what the author | did, (even though you say the author didn't): you'd need to | justify your argument with specific examples. | avmich wrote: | > > * It's highly instructive, in a way that using a parser | generator is not. | | Top-down parsers are nice, but even better would be not to | have to write them each time you need to parse something. | Also top-down parsers don't handle (well) some of useful | grammar constructs, and it's tedious to remember always | formulate grammars in a certain way. Next, grammars are for | many people more convenient to work with than actual parser | code. So parser generators have their place. | | > To quote Feynman: "What I cannot create, I do not | understand" | | And this can be true for parser generators. After you've | figured how to do that you could be tempted to never use | external tools, and have a nice jump from, say, arbitrary CFG | to a generalized LR, while controlling all the parts in | between. | | > * It's not actually difficult! | mannykannot wrote: | Once, I wanted a parser, and being the sort of person who | thinks there is value in seeing what other people have done | before, I took a look at parser generators. | | The learning curve seemed long and steep, much more than seemed | necessary for my modest requirements, so I rolled my own. | | I got a parser out of the exercise, but also an appreciation | for why the tools exist, and a new-found motivation for | learning how to use them. | neurotrace wrote: | I don't want to be that guy but I originally thought the CSS was | broken. Harsh black on harsh white, everything squeezed in to the | left third of my screen, and an unused scrollbar on the bottom. | | That being said, I love the information. I don't care for the | workflow that comes with the most popular parser generators and I | inevitably want more control over my parser as I start adding | error handling, etc. It's easier to just write it by hand in the | first place | millstone wrote: | I think the design is great. Simple and easy to read. | neurotrace wrote: | Different strokes for different folks, I suppose. I found it | to be confusing and painful | Supermancho wrote: | I'm not quite sure why they went with the borked "half the | screen is empty" look. Ok, it's easier to read, but that | negative space is bad design, regardless. | contravariant wrote: | There are unused scrollbars everywhere. It's quite an | interesting example of scrollbar blindness [1]. | | [1]:https://news.ycombinator.com/item?id=24293421 | cwp wrote: | If you're writing a compiler, sure, a hand-written parser is | probably better than a generated parser. | | However, I'd also make the related claim that we should be much | more ready to write formal parsers for other use cases. Far too | often developers implement a "pile of regular expressions" where | a parser--even a generated parser-would be a better choice. | samatman wrote: | An alternate title: "Just Shotgun Parse Your Inputs". | | This isn't bad advice _across the board_ , but it is bad advice | for many applications. | | This is how you end up with context-sensitive formats, or ones | where the semantics of a particular particle require running the | parser, or worse. It exposes a rich surface area for exploitation | with adversarial inputs. | | That said, the advantages cited at the front of the article are | all real: error recovery and decent error messages are unsolved | problems with existing parser generators, although ANTLR is | pretty good at it. | | But, especially if you're developing the data format rather than | just implementing it, I strongly suggest using something like | Instaparse which can generate a parser directly from a grammar. | Once this is solidified, it's reasonable to hand-roll something, | but even there, I suggest using a nice parser combinator library, | like hammer or Nom. | | This provides more discipline: as long as you're coloring inside | the lines, you'll end up with a parser which actually implements | your spec. When you start adding functions to provide context, | recover from errors, and provide meaningful feedback in the form | of error messages for unexpected inputs, you'll know what you're | doing: a classic recursive-descent parser can't provide the same | conceptual separation between the parsing logic and the logic to | support ergonomics or one-pass compiling or whatever add-ons | you're putting in there. | specialist wrote: | _" where the semantics of a particular particle require running | the parser, or worse"_ | | Yup. | | Consider the much maligned (apache) httpd.conf and makefile. | Their code was the source of truth for correctness. No matter | horrible terrible their syntax, they would have been sufferable | if they had grammars. | | Bespoke parsing of expressions benefits from having an implicit | grammar, like this OC. But threshold for needing explicit | grammars is pretty low. | | Prime examples are all the data exchange formats. Descriptions | of JSON and CSV are short and naively reasonable, but | insufficient. As we've seen, any ambiguity that can happen will | happen. ___________________________________________________________________ (page generated 2020-10-21 23:00 UTC)