hngopher.com

       [HN Gopher] Parsing Algorithms
       ___________________________________________________________________
        
       Parsing Algorithms
        
       Author : DmitrySoshnikov
       Score  : 265 points
       Date   : 2020-10-26 16:50 UTC (6 hours ago)
        
 (HTM) web link (dmitrysoshnikov.com)
 (TXT) w3m dump (dmitrysoshnikov.com)
        
       | bgorman wrote:
       | Thanks for building this, I completed the "make a lisp" project,
       | but the parsing stage was greatly simplified so it is nice to
       | have a resource to learn more about the parsing/lexing stage.
        
         | DmitrySoshnikov wrote:
         | Absolutely! S-expression (used in Scheme, Lisp, etc) is a great
         | AST-based syntax to start building an interpreter right away.
         | But for fully ergonomic language you would need a parser for a
         | more complex syntax.
        
       | MaxBarraclough wrote:
       | Seems the place to ask: what do folks make of the _IELR_
       | algorithm? [0] Apparently GNU Bison supports it these days. [1]
       | 
       | [0] https://cs.stackexchange.com/a/99463/
       | 
       | [1] https://en.wikipedia.org/wiki/GNU_Bison
        
       | redwoolf wrote:
       | This is great! I'm working on designing a language right now and
       | I'm just getting to the point where I have to parse the AST.
       | Looking forward to taking this course, I just purchased it on
       | Udemy.
        
         | DmitrySoshnikov wrote:
         | Congrats, and hope this makes building of your parser easy and
         | fun!
         | 
         | This course covers a lot of parsing theory as well, and if
         | you're interested in pure practical (manual) parser, there will
         | be also "Building a Recursive descent parser from scratch",
         | which is mainly coding class and is an extension for the
         | "Parsing Algorithms".
        
       | waynesonfire wrote:
       | super interesting subject; can anyone recommend resources for
       | self learners?
        
         | egonschiele wrote:
         | Isn't this for self-learners?
         | 
         | > The course starts now and never ends! It is a completely
         | self-paced online course - you decide when you start and when
         | you finish.
        
       | andreygrehov wrote:
       | Very well done! I like video animations. What software do you use
       | to make them?
        
         | DmitrySoshnikov wrote:
         | Thanks for the feedback, glad you like it. For production I use
         | combination of software: Keynote, Notability, Good notes,
         | Camtasia, and live editing on iPad.
        
       | rschachte wrote:
       | Is the site down?
        
         | DmitrySoshnikov wrote:
         | Should be up by now; seems auto-DDOS'ed, lol
        
       | ChuckMcM wrote:
       | FWIW, parsing and lexical analysis was the CS class I have used
       | most thoroughly in my career. Sure data structures is probably
       | the most _often_ used, but other than hash tables and b-trees,
       | not much of that class was useful. But lexing and parsing? Seems
       | like every other project benefited by it either in handling
       | configuration files, or log /sensor data, or some other need to
       | convert what was human readable into machine manipulable.
       | 
       | That said, I really recommend the crafting interpreters work. It
       | covers all the bases pretty solidly. If you want more depth and
       | theory then get the dragon book (Ullman on compiler design) and
       | read it afterwards :-)
        
         | DmitrySoshnikov wrote:
         | Yes, in the "Essentials of Interpretation" class (aka "Building
         | an Interpreter from scratch" we focus exactly on runtime
         | semantics, and evaluating the language. The S-expression allows
         | greatly simplifying, focus on runtime specifics themselves,
         | skipping parsing stage altogether.
         | 
         | In "Essentials of Parsing" class (aka "Parsing Algorithms") we
         | shift exactly to the syntax, and understanding the parsing
         | process from within -- this in general may have nothing to do
         | with runtime -- for the same exact syntax you may have
         | different interpreters or VMs (even with different semantics).
        
       | cjauvin wrote:
       | For an alternative take on a related topic, this is really a
       | fantastically well-written and practical (free) book:
       | http://craftinginterpreters.com
        
         | mssundaram wrote:
         | Not free but also very good is https://interpreterbook.com/
        
           | bensonalec wrote:
           | Both of Thorsten Ball's books in that series are phenomenal,
           | could not recommend more. He's also got a great discussion of
           | it on the Go Time podcast, https://changelog.com/gotime/28
           | and https://changelog.com/gotime/107.
        
             | mssundaram wrote:
             | Thanks for sharing the podcast links, I hadn't known about
             | that
        
         | DmitrySoshnikov wrote:
         | Yeah, this class is specifically on parsing pipeline and
         | syntactic analysis.
         | 
         | For runtime semantics (Interpreters and Virtual Machines) you
         | can address "Essentials of Interpretation" aka "Building an
         | Interpreter from scratch".
        
       | davidkellis wrote:
       | I'd like to see something like this for practitioners. I kind of
       | have a feel for what's out there, but I _don 't_ know of anything
       | that is: 1. pleasant to use 2. simple 3. scannerless 4. supports
       | left recursion that produces a left-associative parse
        
         | DmitrySoshnikov wrote:
         | Yes, we use LALR(1) parsing mode to build the actual parser,
         | and it exactly supports Left recursive grammars (which are much
         | more elegant than LL). We also don't focus much on scanner
         | (tokenizer) since this is a topic of Regular expressions and
         | Finite automata which we discuss in detail in the separate
         | class "Building a RegExp machine".
        
         | bmn__ wrote:
         | Marpa meets 2.a) 2.b) 3.
         | 
         | 1. is subjective.
        
       | humanlion87 wrote:
       | Apologies for the tangential question. I am currently working on
       | a codebase that involves trying to understand code that parses
       | and input string (that has various search parameters and values)
       | and generates an appropriate SQL query to search the associated
       | database. I am struggling to build a mental model of how this
       | parsing works. Will going through one of the resources mentioned
       | in this thread help with my understanding?
        
         | DmitrySoshnikov wrote:
         | Yes, if you need to parse that input string to generates an
         | appropriate SQL query, you would need to have a small DSL
         | (domain-specific language) for that "string", whatever it
         | contains. If the string contains SQL-like syntax, e.g. "SELECT
         | name from users", then yes, it would be easy to build a grammar
         | for this.
        
       | default-kramer wrote:
       | Does anyone know if there are any good resources on "tolerant
       | parsing," if that is the correct terminology? For example, when I
       | write C# in Visual Studio, the IDE remains amazingly helpful even
       | when the code is incomplete and would be rejected by a
       | traditional parser. I'd guess that Microsoft simply has the
       | budget to have the VS/C# dev teams grind out hundreds or
       | thousands of special cases that are specific to C#... but I would
       | love to learn that there are some fundamental techniques that
       | would fit into a few credits worth of CS education.
        
         | DmitrySoshnikov wrote:
         | Yes, this is called "parse error recovery" and there are
         | multiple techniques for this. In fact, most of the production
         | parsers support this mode. E.g. when you try executing a C++ or
         | Java file, it shows you all the errors at once instead of
         | failing on first parse error. The way it's achieved is by
         | constructing a "dummy" AST node (caught up to some delimiter,
         | e.g. semicolon in statements) and continue the parsing process
         | as there would be no any error.
        
         | mrits wrote:
         | https://github.com/rust-analyzer/rust-analyzer is a good
         | resource. I went there because I was surprised how good the
         | project worked. I found some really great docs and learned
         | quite a bit from the code.
        
         | derriz wrote:
         | It's fairly straightforward to offer some functionality in this
         | area with hand-written recursive descent parsers.
         | 
         | For example, a language with a C-like syntax, you'll often be
         | parsing a sequence of statements separated by semi-colons (a
         | block). If a statement fails to parse, you can just consume
         | tokens until you hit the next semi-colon and then try to
         | continue to parse statements from there.
         | 
         | Fairly crude approaches like this are easy to implement (at
         | least with recursive-descent) but can be surprisingly
         | effective. It's easy to construct counter examples where an
         | approach like this will get it wrong but in practice it's
         | hugely more useful to the poor user than just abandoning the
         | parse completely.
        
       | wtetzner wrote:
       | Any chance of also including GLL (generalized LL)?
       | 
       | I found the paper (http://dotat.at/tmp/gll.pdf) quite hard to
       | follow, and haven't been able to find a good explanation anywhere
       | else.
        
         | DonaldPShimoda wrote:
         | Just generalized parsing algorithms in general would be good to
         | include, I think. It looks like the course only plans to cover
         | basic LL/LR, which are admittedly the most commonly used
         | parsers but more would be interesting. A fun one to include
         | might be Might's "Parsing with Derivatives", which is
         | algorithmically novel (though not very performant). I think
         | there was a recent innovation on this: "Parsing with Zippers"
         | at ICFP this year.
         | 
         | Hard agree that GLL is hard to follow. I've read the paper a
         | number of times and struggle with it every time haha.
        
           | thesz wrote:
           | Parsing with derivatives is not all that slow:
           | https://www.microsoft.com/en-
           | us/research/uploads/prod/2019/0...
           | 
           | The paper shows that it can compete with re2 (by google).
        
             | DonaldPShimoda wrote:
             | This is not what I'm talking about.
             | 
             | "Parsing with Derivatives" is the title of a 2011 paper by
             | Matt Might, David Darais, and Daniel Spiewak that
             | generalizes the Brzozowski derivative from regular
             | expressions to context-free grammars. In the present
             | discussion, I assumed the context to be about CFGs instead
             | of REs because that is most often what people are referring
             | to when we talk about parsers in programming languages,
             | though I admit that I could have been more explicit about
             | this.
             | 
             | What you linked is an improvement of the Brzozowski
             | derivative, but it does not constitute a parser for CFGs.
        
               | thesz wrote:
               | According to "Parsing Techniques: A Practical Guide" [1],
               | it is quite common for computer languages to have most of
               | the grammar in the regular grammars class and some parts
               | to be, actually, context-free.
               | 
               | [1] https://dickgrune.com/Books/PTAPG_1st_Edition/BookBod
               | y.pdf
               | 
               | For example, consider addition and subtraction in most
               | grammars. They can be expressed as "summation ::= factor
               | ((PLUS | MINUS) factor) * " and factor can be similarly
               | defined as "factor ::= multiplicand ((MUL | DIV)
               | multiplicand) * ".
               | 
               | Authors of PTAPG note that these regularities can be
               | exploited for speed. And I think "parsing with
               | derivatives" techniques can be used for speedy parsing
               | too.
        
           | giraj wrote:
           | Thanks for mentioning "Parsing with Zippers"! I read "Parsing
           | with Derivatives" last week and wondered if that could be
           | taken further. The paper can be found here:
           | https://dl.acm.org/doi/10.1145/3408990
        
             | DonaldPShimoda wrote:
             | It's always fun to spread new research around the
             | community! I remember reading PwD and thinking it was just
             | _so cool_ as an algorithm, though the performance concerns
             | were a bit of a turn-off. Still, I always like to talk to
             | people about it because I just think it 's such an elegant
             | approach to the problem.
             | 
             | As for PwZ, I know one of the authors so maybe I'm a little
             | biased, but I also thought his ICFP talk was quite good.
             | It's here: https://youtu.be/fakSKvP9yaM?t=6180
        
         | samatman wrote:
         | I found the code for Instaparse (relatively) easy to follow.
         | 
         | I had considered leaving a comment here like "hey could you
         | cover combinators and PEGs?", but after thinking it over, it's
         | important to limit the scope for a class like this.
         | 
         | It would be pretty great to offer a "201" edition, covering
         | ALL*, GLR, GLL, combinators/PEGs, Earley, parsing-with-
         | derivatives, Marpa, and anything else I might have forgotten:
         | basically a survey of modern parsing algorithms, which frankly,
         | LR and LL are not.
         | 
         | But for your own learning, I bet you could take this course,
         | and then spend some time with Instaparse and the GLL paper, and
         | walk away with a solid understanding of GLL in practice.
        
           | DmitrySoshnikov wrote:
           | Great point on combinators, PEG, and GLL -- this potentially
           | would be covered in 201 as suggested, since it's good having
           | a foundation of the LL/LR, and then gradually moving to
           | combinators if needed. LALR(1) covers a pretty wide range of
           | the most practical languages.
        
             | samatman wrote:
             | To a significant degree, the arrow of causality runs
             | LALR(1) -> practical languages, not the other direction!
             | 
             | The languages and formats we use have been heavily shaped
             | by the practical parsing algorithms of the 20th century. An
             | example: you can't have a struct field called "while" in C,
             | because once the lexer declares a token to be a keyword,
             | that's that.
        
               | jcranmer wrote:
               | Contextual keywords are a growing thing in modern
               | languages. Modern compilers don't tend to have strictly
               | separated lexers and parsers, but instead use a combined
               | lexer/parser model that feeds information back and forth
               | between them. If your language is designed such that you
               | aren't often in multiple potential parse states, then
               | it's easy to feed into the lexer "get me the next token,
               | and by the way, expect function attributes to be keywords
               | right now."
               | 
               | Note that the requirement to not be in multiple potential
               | parse states also tends to boil down to "build a language
               | that's usually LL(1) or LALR(1)."
        
               | DmitrySoshnikov wrote:
               | Yes, to some degree -- Syntax tool normally support lexer
               | states, and the same "while" token may mean a keyword or
               | the property/field name of a struct. You can find more
               | details of the lexer states in the docs.
        
               | tgv wrote:
               | TBH, that's not a problem of LALR(1), or any of the
               | other, more old fashioned methods. I've written an LL(1)
               | parser generator that (generates a function that) parses
               | modern awk, without semicolons, but with operator-less
               | concatenation, and that can deal with tokens that can be
               | keywords and identifiers (which is also needed for e.g.
               | FORTRAN). As long as your language is deterministic, it
               | can be expressed as an LR grammar, although legibility
               | might suffer.
        
           | mehrdadn wrote:
           | What is GRR? Did you mean GLR?
        
             | samatman wrote:
             | Yep, that was a typo, fixed it
        
               | mehrdadn wrote:
               | Ah okay. Note that GLR isn't exactly modern (1974).
               | However, I would also remove "modern" as a requirement
               | here... what really matters is how good the algorithm is,
               | not how old it is. "Modern" algorithms easily end up
               | being less powerful than the old ones... they just end up
               | becoming popular due to other factors, e.g. simplicity.
        
               | samatman wrote:
               | "Advanced" would have been more expressive, agreed.
               | 
               | In the sense that LALR and LL with limited lookahead are
               | the introductory parsing algorithms which everyone knows,
               | and there's a reason for that.
        
         | DmitrySoshnikov wrote:
         | Yes, GLL is a good algorithm and I potentially going to publish
         | it separately as a single public video.
        
           | wtetzner wrote:
           | That would be fantastic!
        
       ___________________________________________________________________
       (page generated 2020-10-26 23:00 UTC)