[HN Gopher] Parsing Algorithms ___________________________________________________________________ Parsing Algorithms Author : DmitrySoshnikov Score : 265 points Date : 2020-10-26 16:50 UTC (6 hours ago) (HTM) web link (dmitrysoshnikov.com) (TXT) w3m dump (dmitrysoshnikov.com) | bgorman wrote: | Thanks for building this, I completed the "make a lisp" project, | but the parsing stage was greatly simplified so it is nice to | have a resource to learn more about the parsing/lexing stage. | DmitrySoshnikov wrote: | Absolutely! S-expression (used in Scheme, Lisp, etc) is a great | AST-based syntax to start building an interpreter right away. | But for fully ergonomic language you would need a parser for a | more complex syntax. | MaxBarraclough wrote: | Seems the place to ask: what do folks make of the _IELR_ | algorithm? [0] Apparently GNU Bison supports it these days. [1] | | [0] https://cs.stackexchange.com/a/99463/ | | [1] https://en.wikipedia.org/wiki/GNU_Bison | redwoolf wrote: | This is great! I'm working on designing a language right now and | I'm just getting to the point where I have to parse the AST. | Looking forward to taking this course, I just purchased it on | Udemy. | DmitrySoshnikov wrote: | Congrats, and hope this makes building of your parser easy and | fun! | | This course covers a lot of parsing theory as well, and if | you're interested in pure practical (manual) parser, there will | be also "Building a Recursive descent parser from scratch", | which is mainly coding class and is an extension for the | "Parsing Algorithms". | waynesonfire wrote: | super interesting subject; can anyone recommend resources for | self learners? | egonschiele wrote: | Isn't this for self-learners? | | > The course starts now and never ends! It is a completely | self-paced online course - you decide when you start and when | you finish. | andreygrehov wrote: | Very well done! I like video animations. What software do you use | to make them? | DmitrySoshnikov wrote: | Thanks for the feedback, glad you like it. For production I use | combination of software: Keynote, Notability, Good notes, | Camtasia, and live editing on iPad. | rschachte wrote: | Is the site down? | DmitrySoshnikov wrote: | Should be up by now; seems auto-DDOS'ed, lol | ChuckMcM wrote: | FWIW, parsing and lexical analysis was the CS class I have used | most thoroughly in my career. Sure data structures is probably | the most _often_ used, but other than hash tables and b-trees, | not much of that class was useful. But lexing and parsing? Seems | like every other project benefited by it either in handling | configuration files, or log /sensor data, or some other need to | convert what was human readable into machine manipulable. | | That said, I really recommend the crafting interpreters work. It | covers all the bases pretty solidly. If you want more depth and | theory then get the dragon book (Ullman on compiler design) and | read it afterwards :-) | DmitrySoshnikov wrote: | Yes, in the "Essentials of Interpretation" class (aka "Building | an Interpreter from scratch" we focus exactly on runtime | semantics, and evaluating the language. The S-expression allows | greatly simplifying, focus on runtime specifics themselves, | skipping parsing stage altogether. | | In "Essentials of Parsing" class (aka "Parsing Algorithms") we | shift exactly to the syntax, and understanding the parsing | process from within -- this in general may have nothing to do | with runtime -- for the same exact syntax you may have | different interpreters or VMs (even with different semantics). | cjauvin wrote: | For an alternative take on a related topic, this is really a | fantastically well-written and practical (free) book: | http://craftinginterpreters.com | mssundaram wrote: | Not free but also very good is https://interpreterbook.com/ | bensonalec wrote: | Both of Thorsten Ball's books in that series are phenomenal, | could not recommend more. He's also got a great discussion of | it on the Go Time podcast, https://changelog.com/gotime/28 | and https://changelog.com/gotime/107. | mssundaram wrote: | Thanks for sharing the podcast links, I hadn't known about | that | DmitrySoshnikov wrote: | Yeah, this class is specifically on parsing pipeline and | syntactic analysis. | | For runtime semantics (Interpreters and Virtual Machines) you | can address "Essentials of Interpretation" aka "Building an | Interpreter from scratch". | davidkellis wrote: | I'd like to see something like this for practitioners. I kind of | have a feel for what's out there, but I _don 't_ know of anything | that is: 1. pleasant to use 2. simple 3. scannerless 4. supports | left recursion that produces a left-associative parse | DmitrySoshnikov wrote: | Yes, we use LALR(1) parsing mode to build the actual parser, | and it exactly supports Left recursive grammars (which are much | more elegant than LL). We also don't focus much on scanner | (tokenizer) since this is a topic of Regular expressions and | Finite automata which we discuss in detail in the separate | class "Building a RegExp machine". | bmn__ wrote: | Marpa meets 2.a) 2.b) 3. | | 1. is subjective. | humanlion87 wrote: | Apologies for the tangential question. I am currently working on | a codebase that involves trying to understand code that parses | and input string (that has various search parameters and values) | and generates an appropriate SQL query to search the associated | database. I am struggling to build a mental model of how this | parsing works. Will going through one of the resources mentioned | in this thread help with my understanding? | DmitrySoshnikov wrote: | Yes, if you need to parse that input string to generates an | appropriate SQL query, you would need to have a small DSL | (domain-specific language) for that "string", whatever it | contains. If the string contains SQL-like syntax, e.g. "SELECT | name from users", then yes, it would be easy to build a grammar | for this. | default-kramer wrote: | Does anyone know if there are any good resources on "tolerant | parsing," if that is the correct terminology? For example, when I | write C# in Visual Studio, the IDE remains amazingly helpful even | when the code is incomplete and would be rejected by a | traditional parser. I'd guess that Microsoft simply has the | budget to have the VS/C# dev teams grind out hundreds or | thousands of special cases that are specific to C#... but I would | love to learn that there are some fundamental techniques that | would fit into a few credits worth of CS education. | DmitrySoshnikov wrote: | Yes, this is called "parse error recovery" and there are | multiple techniques for this. In fact, most of the production | parsers support this mode. E.g. when you try executing a C++ or | Java file, it shows you all the errors at once instead of | failing on first parse error. The way it's achieved is by | constructing a "dummy" AST node (caught up to some delimiter, | e.g. semicolon in statements) and continue the parsing process | as there would be no any error. | mrits wrote: | https://github.com/rust-analyzer/rust-analyzer is a good | resource. I went there because I was surprised how good the | project worked. I found some really great docs and learned | quite a bit from the code. | derriz wrote: | It's fairly straightforward to offer some functionality in this | area with hand-written recursive descent parsers. | | For example, a language with a C-like syntax, you'll often be | parsing a sequence of statements separated by semi-colons (a | block). If a statement fails to parse, you can just consume | tokens until you hit the next semi-colon and then try to | continue to parse statements from there. | | Fairly crude approaches like this are easy to implement (at | least with recursive-descent) but can be surprisingly | effective. It's easy to construct counter examples where an | approach like this will get it wrong but in practice it's | hugely more useful to the poor user than just abandoning the | parse completely. | wtetzner wrote: | Any chance of also including GLL (generalized LL)? | | I found the paper (http://dotat.at/tmp/gll.pdf) quite hard to | follow, and haven't been able to find a good explanation anywhere | else. | DonaldPShimoda wrote: | Just generalized parsing algorithms in general would be good to | include, I think. It looks like the course only plans to cover | basic LL/LR, which are admittedly the most commonly used | parsers but more would be interesting. A fun one to include | might be Might's "Parsing with Derivatives", which is | algorithmically novel (though not very performant). I think | there was a recent innovation on this: "Parsing with Zippers" | at ICFP this year. | | Hard agree that GLL is hard to follow. I've read the paper a | number of times and struggle with it every time haha. | thesz wrote: | Parsing with derivatives is not all that slow: | https://www.microsoft.com/en- | us/research/uploads/prod/2019/0... | | The paper shows that it can compete with re2 (by google). | DonaldPShimoda wrote: | This is not what I'm talking about. | | "Parsing with Derivatives" is the title of a 2011 paper by | Matt Might, David Darais, and Daniel Spiewak that | generalizes the Brzozowski derivative from regular | expressions to context-free grammars. In the present | discussion, I assumed the context to be about CFGs instead | of REs because that is most often what people are referring | to when we talk about parsers in programming languages, | though I admit that I could have been more explicit about | this. | | What you linked is an improvement of the Brzozowski | derivative, but it does not constitute a parser for CFGs. | thesz wrote: | According to "Parsing Techniques: A Practical Guide" [1], | it is quite common for computer languages to have most of | the grammar in the regular grammars class and some parts | to be, actually, context-free. | | [1] https://dickgrune.com/Books/PTAPG_1st_Edition/BookBod | y.pdf | | For example, consider addition and subtraction in most | grammars. They can be expressed as "summation ::= factor | ((PLUS | MINUS) factor) * " and factor can be similarly | defined as "factor ::= multiplicand ((MUL | DIV) | multiplicand) * ". | | Authors of PTAPG note that these regularities can be | exploited for speed. And I think "parsing with | derivatives" techniques can be used for speedy parsing | too. | giraj wrote: | Thanks for mentioning "Parsing with Zippers"! I read "Parsing | with Derivatives" last week and wondered if that could be | taken further. The paper can be found here: | https://dl.acm.org/doi/10.1145/3408990 | DonaldPShimoda wrote: | It's always fun to spread new research around the | community! I remember reading PwD and thinking it was just | _so cool_ as an algorithm, though the performance concerns | were a bit of a turn-off. Still, I always like to talk to | people about it because I just think it 's such an elegant | approach to the problem. | | As for PwZ, I know one of the authors so maybe I'm a little | biased, but I also thought his ICFP talk was quite good. | It's here: https://youtu.be/fakSKvP9yaM?t=6180 | samatman wrote: | I found the code for Instaparse (relatively) easy to follow. | | I had considered leaving a comment here like "hey could you | cover combinators and PEGs?", but after thinking it over, it's | important to limit the scope for a class like this. | | It would be pretty great to offer a "201" edition, covering | ALL*, GLR, GLL, combinators/PEGs, Earley, parsing-with- | derivatives, Marpa, and anything else I might have forgotten: | basically a survey of modern parsing algorithms, which frankly, | LR and LL are not. | | But for your own learning, I bet you could take this course, | and then spend some time with Instaparse and the GLL paper, and | walk away with a solid understanding of GLL in practice. | DmitrySoshnikov wrote: | Great point on combinators, PEG, and GLL -- this potentially | would be covered in 201 as suggested, since it's good having | a foundation of the LL/LR, and then gradually moving to | combinators if needed. LALR(1) covers a pretty wide range of | the most practical languages. | samatman wrote: | To a significant degree, the arrow of causality runs | LALR(1) -> practical languages, not the other direction! | | The languages and formats we use have been heavily shaped | by the practical parsing algorithms of the 20th century. An | example: you can't have a struct field called "while" in C, | because once the lexer declares a token to be a keyword, | that's that. | jcranmer wrote: | Contextual keywords are a growing thing in modern | languages. Modern compilers don't tend to have strictly | separated lexers and parsers, but instead use a combined | lexer/parser model that feeds information back and forth | between them. If your language is designed such that you | aren't often in multiple potential parse states, then | it's easy to feed into the lexer "get me the next token, | and by the way, expect function attributes to be keywords | right now." | | Note that the requirement to not be in multiple potential | parse states also tends to boil down to "build a language | that's usually LL(1) or LALR(1)." | DmitrySoshnikov wrote: | Yes, to some degree -- Syntax tool normally support lexer | states, and the same "while" token may mean a keyword or | the property/field name of a struct. You can find more | details of the lexer states in the docs. | tgv wrote: | TBH, that's not a problem of LALR(1), or any of the | other, more old fashioned methods. I've written an LL(1) | parser generator that (generates a function that) parses | modern awk, without semicolons, but with operator-less | concatenation, and that can deal with tokens that can be | keywords and identifiers (which is also needed for e.g. | FORTRAN). As long as your language is deterministic, it | can be expressed as an LR grammar, although legibility | might suffer. | mehrdadn wrote: | What is GRR? Did you mean GLR? | samatman wrote: | Yep, that was a typo, fixed it | mehrdadn wrote: | Ah okay. Note that GLR isn't exactly modern (1974). | However, I would also remove "modern" as a requirement | here... what really matters is how good the algorithm is, | not how old it is. "Modern" algorithms easily end up | being less powerful than the old ones... they just end up | becoming popular due to other factors, e.g. simplicity. | samatman wrote: | "Advanced" would have been more expressive, agreed. | | In the sense that LALR and LL with limited lookahead are | the introductory parsing algorithms which everyone knows, | and there's a reason for that. | DmitrySoshnikov wrote: | Yes, GLL is a good algorithm and I potentially going to publish | it separately as a single public video. | wtetzner wrote: | That would be fantastic! ___________________________________________________________________ (page generated 2020-10-26 23:00 UTC)