[HN Gopher] Resources for Amateur Compiler Writers
       ___________________________________________________________________
        
       Resources for Amateur Compiler Writers
        
       Author : kilodeca
       Score  : 285 points
       Date   : 2021-04-24 14:58 UTC (8 hours ago)
        
 (HTM) web link (c9x.me)
 (TXT) w3m dump (c9x.me)
        
       | chandlore wrote:
       | Latest versions of the ABI specifications linked in the _Machine
       | Specific_ section
       | 
       | ARM: https://github.com/ARM-software/abi-aa/releases
       | 
       | x86-64: https://gitlab.com/x86-psABIs/x86-64-ABI (go to most
       | recent CI job and download artifacts for a compiled PDF)
        
       | hardwaregeek wrote:
       | This is definitely a backend, machine code focused list. If
       | you're writing a compiler to WebAssembly or even the JVM you're
       | probably not gonna need a lot of these resources.
       | 
       | Indeed I'd argue that these lists are more for an intermediate
       | compiler writer. Beginner compiler writers don't need to know
       | register allocation. They need to understand how to think about
       | language translation, how to write an intermediate
       | representation, how to learn a target language.
       | 
       | I've been toying with the idea of writing a post that goes
       | through the thought process of translation. Specifically, how you
       | build states, and invariants. When I write code generation I
       | spend time thinking about the different states the stack could be
       | in, or what invariant I need to keep across this expression (do I
       | need to save a register or global?).
       | 
       | One simple but important example of this is realizing that given
       | a proper, sound type checker, your code generator should almost
       | never give errors. By the time you reach code generation you
       | should have proven enough invariants that code generation cannot
       | fail. This isn't always true but I've found it to be the case.
        
         | seanmcdirmid wrote:
         | That reminds me of the dragon book that goes heavily into VAX
         | assembly and optimizing for CISC machines. Probably not the
         | best resource for beginners or amateurs these days, but it won
         | a Turing award at least.
         | 
         | The problem is what it means to write a compiler is pretty
         | broad, some amateur projects focus on backend work, some on
         | front end, many in between.
        
           | coryrc wrote:
           | Also tremendous amounts just on parsing C. I found Lisp In
           | Small Pieces far more interesting.
        
         | enos_feedler wrote:
         | You can certain take any path into compiler writing. I dont
         | know anything about AST or front end compilers but ended up
         | writing register allocators and constant folding for nvidia GPU
         | JIT
        
         | jstanley wrote:
         | > Indeed I'd argue that these lists are more for an
         | intermediate compiler writer. Beginner compiler writers don't
         | need to know register allocation.
         | 
         | I'm not disagreeing with you, but note that the title does not
         | say it's for beginners, it says it's for amateurs, which is
         | just anyone who does it for fun rather than money.
        
         | orthoxerox wrote:
         | > Beginner compiler writers don't need to know register
         | allocation. They need to understand how to think about language
         | translation, how to write an intermediate representation, how
         | to learn a target language.
         | 
         | I don't think people that want to write a compiler want to end
         | up with helloworld.c instead of helloword.exe. At least I
         | don't.
        
           | fooker wrote:
           | Ideally, you'd emit LLVM IR or some similar bytecode or
           | assembly. Then you'd use llc/opt or an appropriate assmebler
           | to get an executable.
           | 
           | Trying to emit machine code isn't too interesting when you're
           | writing your first compiler.
        
             | CalChris wrote:
             | Similarly, trying to parse C++ isn't too interesting when
             | you're writing your first backend.
        
           | hardwaregeek wrote:
           | At least for me, compiling to a simpler target made learning
           | the ideas easier. Compiling to to raw assembly does give you
           | more nerd credit but it also makes it harder to get started.
        
           | ajkjk wrote:
           | The opposite, really. I started learning about compilers
           | because I wanted to implement my own syntactic ideas. Don't
           | care at all how it runs.
        
           | bootwoot wrote:
           | The creator of C++ would disagree, as C++ was devised as a
           | compile-to-c language.
        
       | jhgb wrote:
       | Perhaps Cooper's and Torczon's "Engineering a Compiler (2nd
       | Edition)" should be among the entries in the "Books" category.
        
       | estebank wrote:
       | A lot of ink is spilled on the details of how to make compilers
       | go from taking text and turn it into fast executables, but I
       | believe that the User Experience of compilers is as important a
       | subject as those, but there's very little information about it in
       | the literature, outside of a handful of papers and the design
       | documents of existing compilers that do focus on this.
        
         | dmitriid wrote:
         | No idea why you are being downvoted, but it's true.
         | 
         | Compiler UX is often horrible, but the new breed of languages
         | tries to rectify parts of it:
         | 
         | - non-crpytic error messages. Show _exactly_ what failed, where
         | it failed, how to solve it.
         | 
         | - compiler-as-a-service. Tool makers will love you.
         | 
         | - compilation speed. If I need to wait 2 minutes for a 10 LOC
         | project to compile, your compiler is doing something wrong.
         | 
         | - the compiler should be the only tool you need. If you rely on
         | 15 different tools just to compile a project, you're doing
         | something wrong
        
           | estebank wrote:
           | As you allude to, treating a compiler as an isolated batch
           | process is an outdated strategy. Any compiler started today
           | should be designed such that it can be used in an IDE from
           | the start, even if you don't implement it in that way until
           | after you've reached a certain level of maturity.
           | 
           | For error messages in particular is a very expansive topic
           | because a compiler might have the following components:
           | Lexer         Parser         Macro Expansion         Name
           | Resolution         Lowering         Type Checking
           | Privacy Checking         Lifetime Analysis         Code
           | Generation         Linking
           | 
           | and _all of them_ can produce errors. If you introduce error
           | recovery in your parser, then you better make sure you 're
           | carrying that information forward to name resolution and type
           | checking. If you introduce a error type placeholder, better
           | account for it during the rest of your analysis to avoid
           | complaining about non-existing methods on something that
           | wasn't resolved in the first place. Error deduplication.
           | Exploring the grammar space to identify things that a human
           | may attempt but isn't allowed for good reason but that will
           | not blow up until some quasi-random stage. Now that I think
           | about it you could write a _whole book_ about these topic.
        
             | tomcam wrote:
             | What is lowering?
        
               | hctaw wrote:
               | Some languages require higher level internal
               | representations ("IR"s) to be "lowered" to lower level
               | representations before code generation - in other words
               | like high level semantics or syntactic sugar and
               | transform into simpler constructs to compile.
               | 
               | For example, a language could support special iterator
               | semantics that will be "lowered" into a simple for loop.
               | Or a for loop itself may be lowered into a do-while loop,
               | depending on the design of the lowered IR.
        
           | hctaw wrote:
           | I downvoted it because I think it's a meme that ignores where
           | contemporary compiler engineering is happening. We don't even
           | realize they exist anymore - V8, Graal, Roslyn, etc are all
           | optimizing JIT compilers running in the background that can
           | compile multiple languages on the fly. And they do a great
           | job.
           | 
           | Not everything is a C++ compiler, which is inherently
           | difficult to compile correctly. Let alone quickly.
           | 
           | I think the future of compilers is that everything will have
           | an incremental parser and JIT compiler with optional AOT
           | compilation for release builds. Hopefully LSP and DAP will
           | take over the world of tooling integration.
           | 
           | There are other fundamental mistakes to look at than error
           | messages and performance, like pretending that build systems
           | don't exist and that linking multiple compilation units isn't
           | something the compiler can do. Compiling code is getting much
           | easier, but building it is much harder. Compilers have the
           | most information and internal structure to handle all our
           | build problems, but don't do it.
        
           | turminal wrote:
           | > - non-crpytic error messages. Show exactly what failed,
           | where it failed, how to solve it.
           | 
           | I'm working on a compiler in my free time and I agree with
           | you, but in defense of compilers that suck at this: error
           | reporting is really, really hard to do right.
        
       | Rochus wrote:
       | There is a more recent edition of "Compilers: Principles,
       | Techniques, and Tools" worth considering:
       | https://www.amazon.com/Compilers-Principles-Techniques-Tools....
        
       | ChicagoDave wrote:
       | I've toyed with compiler stuff for years. My current idea is a
       | fluent/piping language. Is there any reference that would help
       | with that?
        
       | [deleted]
        
       | alanafarias22 wrote:
       | Amei esse posto
        
       | alanafarias22 wrote:
       | Esses comentarios ajudam muito
        
       | andromeduck wrote:
       | Wish there were more resources about or tools for bidirectional
       | parsing.
        
       | hasitseth wrote:
       | Alan Holub's 'Compiler Design in C' is not mentioned in the
       | books. Holub's book had well-written code. I had learned a lot
       | from his code.
        
         | elvis70 wrote:
         | And a free copy of this book in PDF has been made available by
         | the author: https://holub.com/compiler/
        
       | failwhaleshark wrote:
       | ProTips after writing numerous kinds of compilers in university:
       | 
       | Recursive descent with backtracking is much easier to construct
       | and maintain for diagnostics than bison/flex.
       | 
       | Derivative parsing is very interesting.
        
       | ahelwer wrote:
       | Does anybody know of good introductory resources on error
       | recovery techniques when writing a parser & interpreter? Bonus
       | points if they use combinators instead of a hand-written parser.
        
         | ruste wrote:
         | I've actually been meaning to write something up about this.
         | I've found you can get fairly good error messages from a parser
         | combinator style parser just by saving the deepest location you
         | successfully parsed to and the parser you failed in.
        
         | jstimpfle wrote:
         | Not a resource, but depending on what kind of parser you are
         | doing, maybe an idea.
         | 
         | The thing I'm going to try out next for one of my batch parsers
         | (written recursive descent style as many or most parsers in the
         | wild) is to have AST nodes of kind "invalid". The idea is that
         | the parser should always complete, with minimal in-line error
         | checking, even when I have say "require_token_kind()" or
         | similar calls in my parser code.
         | 
         | Maybe not even an "invalid" kind, but an additional "invalid"
         | flag, so that even those invalid nodes can still have all the
         | regular node-kind-depending structure on them.
         | 
         | The other obvious idea is to use a parser generator. But I'm
         | pretty sure I wouldn't be happy on that route - from the little
         | experiments that I've done, I infer it's too bureaucratic, too
         | much boilerplate for me to suffer.
         | 
         | The more general advice for error handling is that you should
         | try to make APIs that require very little in-line error
         | handling. Instead, keep errors behind the APIs, maybe collect
         | them in a list, or only keep the first or last error, and allow
         | to handle them at a later point (out-of-line).
         | 
         | And boom, most of the talk about smart syntax and type systems
         | and other clever systems to make error handling more convenient
         | suddenly becomes obsolete.
         | 
         | Yet more abstractly, don't try to build systems that make
         | cumbersome things more convenient, but think how you can do
         | with less of these cumbersome things. (That works wonders for
         | runtime performance, too).
        
           | tester756 wrote:
           | >The more general advice for error handling is that you
           | should try to make APIs that require very little in-line
           | error handling. Instead, keep errors behind the APIs, maybe
           | collect them in a list, or only keep the first or last error,
           | and allow to handle them at a later point (out-of-line).
           | 
           | I guess you wanted to say that instead of printing just
           | what's wrong at some point, be smarter and try to build some
           | handier error handling like attaching error to ast's node and
           | then maybe do something with it
        
             | jstimpfle wrote:
             | The important point is the temporal decoupling between the
             | time where the error is detected and the time where it is
             | handled.
        
           | [deleted]
        
       | UncleEntity wrote:
       | _The Zephyr Abstract Syntax Description Language_ , Wang et al.
       | 
       | Has been invaluable for creating the AST (and soon the IR
       | nodes)-- though it has been a source of much yak shaving as I've
       | rewritten my asdl generator at least 3 times.
       | 
       |  _Language Implementation Patterns: Create Your Own Domain-
       | Specific and General Programming Languages Book_ , Terence Parr
       | 
       | Pretty java/antlr focused but wasn't too hard to take the
       | concepts and apply them to a C++ based transpiler I've been
       | slowly poking at.
       | 
       |  _Combining Analyses, Combining Optimizations_ , Cliff Click
       | 
       | His thesis and an extension of the linked paper _Global Code
       | Motion, Global Value Numbering_ -- my next step down the rabbit
       | hole...
       | 
       |  _Tree Automata Techniques and Applications_ , Comon et al.
       | 
       | Still on the fence on this one vs the iburg/burs algorithm,
       | probably be more useful and match better with the way Parr's book
       | goes about things.
        
       | phonebucket wrote:
       | One thing I find curious: lots of compiler writing seems to be
       | done in Haskell, but I can hardly find any Haskell-based
       | resources for writing compilers.
       | 
       | Can anyone here please recommend any such resources?
        
         | toolslive wrote:
         | https://www.microsoft.com/en-us/research/publication/impleme...
        
         | atsuzaki wrote:
         | This is a pretty good tutorial that I know of:
         | https://www.stephendiehl.com/llvm/. The author went on to write
         | a book too, though it's still in progress
         | http://dev.stephendiehl.com/fun/
        
         | tester756 wrote:
         | >lots of compiler writing seems to be done in Haskell
         | 
         | those 4fun of real world?
        
           | mechanicalDigit wrote:
           | Most compilers are written for fun. That's the point of the
           | thread.
           | 
           | Also, there is a lot of modern language research is done from
           | the "lens" (heh) of Haskell, or at least the Glasgow school.
        
       | piinbinary wrote:
       | Are there any good resources for building a runtime and codegen
       | for languages with garbage collection?
        
         | moonchild wrote:
         | For a basic GC, code generation should be unaffected. The only
         | place you'd need to care about code generation is if you wanted
         | to make a realtime GC, inserting read or write barriers.
        
           | piinbinary wrote:
           | My understanding from doing a bit of reading is that there
           | are at least two things codegen-adjacent that an accurate
           | collector will want (even if you stop the world while
           | collecting):
           | 
           | 1. The locations of pointers inside of allocations.
           | 
           | 2. The locations of pointers in the stack (into which you may
           | need to spill registers).
           | 
           | I can think of ways to do these (e.g. perhaps each stack
           | frame contains a bitmask at a certain offset), but I'd love
           | to learn what the best practices are.
        
             | moonchild wrote:
             | Usually your objects are boxed, and the box contains a type
             | identifier which the gc can use to identify pointer
             | offsets. It is apparently possible to do precise garbage
             | collection without boxes--I'll see if I can find the paper
             | and link it later--but this is not widely done.
        
         | aardvark179 wrote:
         | The Garbage Collection Handbook is excellent, but might go into
         | more depth than you need. For a simple garbage collector there
         | needn't be any effect on codegen, all you need to worry about
         | is the runtime side of things. Search for how to build a
         | conservative garbage collector and you should find plenty of
         | tutorials and introductory CS courses on the subject.
        
       | nethunters wrote:
       | Any similar resources on writing interpreters?
       | 
       | Also is there are a technical difference between an interpreter
       | and a VM?
        
         | mtlynch wrote:
         | I haven't read it, but Bob Nystrom's _Crafting Interpreters_
         | has been on my list for a while:
         | 
         | https://craftinginterpreters.com/
        
           | smueller1234 wrote:
           | Don't wait, read it. It's fun and informative, worth every
           | minute! I haven't had as much fun reading a technical book in
           | years!
        
           | orthoxerox wrote:
           | It's a great book. Bob's not a CS teacher, and it's great:
           | the book is grounded in practice and has been structured in a
           | way that lets you get a tangible result at the end of every
           | chapter.
        
           | LordN00b wrote:
           | I second this. It's a great resource, very well written.
           | Opinionated enough to have character and humour, but not so
           | dry as to make your eyes glaze over. I guess I'm saying that
           | for a technical tome it's very, very readable (as was his
           | Game Design Patterns book incidentally)
        
         | theodpHN wrote:
         | Check out Peter Norvig's neat BASIC Interpreter
         | https://github.com/norvig/pytudes/blob/master/ipynb/BASIC.ip...
        
         | MH15 wrote:
         | Interpreters usually execute by walking the abstract syntax
         | tree at runtime, bytecode VMs usually execute actual bytecode
         | similar to machine code. The bytecode for such VMs is often way
         | simpler than the instruction set for an actual computer, so
         | it's far more portable between platforms.
        
           | tom_mellior wrote:
           | > Interpreters usually execute by walking the abstract syntax
           | tree at runtime, bytecode VMs usually execute actual bytecode
           | similar to machine code.
           | 
           | This isn't the right way to separate these concepts. A VM
           | that executes bytecodes by interpretation is also an
           | interpreter (Python, Ruby, PHP are well-known examples). Many
           | VMs don't interpret, or they interpret mostly during startup,
           | but actually try to compile hot parts of the code and execute
           | that machine code rather than interpreting (Java and
           | JavaScript VMs are well-known examples).
           | 
           | The VM part more properly refers to things like data
           | representation and memory management. Interpreters of an
           | abstract syntax tree, interpreters of bytecodes, and just in
           | time compilers all need to run inside a system that takes
           | care of loading code, allocating memory when requested, and
           | often doing garbage collection. These services are what make
           | a VM. The exact execution strategy is not what determines VM-
           | ness.
        
           | nethunters wrote:
           | Thanks for clearing that up.
           | 
           | Which languages or implementations of languages directly
           | interpret the AST without the intermediary bytecode
           | compilation?
           | 
           | I know Python, Java and JavaScript (V8 and SpiderMonkey) all
           | compile to bytecode first probably to speed up subsequent
           | runs and run some optimisations on the bytecode.
           | 
           | What other benefits are there to compiling to bytecode first
           | vs directly interpreting the AST?
        
             | smueller1234 wrote:
             | Perl is a tree based interpreter. It's not exactly an AST
             | anymore but the time it's being executed, but close enough.
        
             | herohamp wrote:
             | Portability. Say I wanted to make language X run on all
             | platforms, but I didn't actually care about compiling it on
             | all platforms. I can just write a relatively simple VM for
             | each platform. This is one of the reasons Java was and
             | still kinda is so ubiquitous
        
               | nethunters wrote:
               | Wouldn't writing an interpreter for each platform be less
               | work and achieve the same goal as writing a VM for each
               | platform?
               | 
               | Edit: ^Aside from being able to execute the bytecode on
               | any platform
        
               | kaba0 wrote:
               | As others mentioned, source code should be distributed
               | that way, and I think creating a simple VM is easier than
               | a simple language parser. But of course, an optimizing
               | one can be really quite complex in both cases.
        
               | klyrs wrote:
               | Why would it be less work? The interpreter will need to
               | implement whatever operations a VM can perform, so a
               | priori it's _at least as much_ work. Bonus, if you can
               | bootstrap the source- >bytecode process, then you only
               | need to write (and compile) that once to get a full-
               | fledged interpreter on every host with a VM
        
             | CalChris wrote:
             | If you compile to AST and walk that then your portability
             | is at the source level; you have to send the source over
             | the wire; each target then needs a parser+walker and each
             | target needs to parse+walk. If you compile to bytecode you
             | can send bytecode over the wire and then simply interpret
             | that bytecode.
        
             | jhgb wrote:
             | Gambit Scheme, CLISP, CMUCL are capable of interpreting
             | ASTs, and I believe (although I'm not 100% sure) that this
             | is the case for Lispworks and Allegro CL as well. Also,
             | Ruby until version 1.8.
        
             | IainIreland wrote:
             | One major benefit of compiling to bytecode first is that
             | bytecode is a more convenient shared source of truth.
             | 
             | For example, SpiderMonkey has two interpreter and two
             | compiler tiers. The output of the parser is bytecode. We
             | can interpret it immediately, and it's a convenient input
             | to the compilers. It also simplifies transitions between
             | tiers: for example, when we bail out of Warp, we resume at
             | a specific bytecode offset in the baseline interpreter.
             | 
             | I'm not sure how you would resume at a particular point in
             | the execution of an AST-based interpreter without agreeing
             | on a linearized traversal of the AST, which is 90% of the
             | way to bytecode.
        
           | quelltext wrote:
           | Keep in mind that bytecode interpreted languages (Ruby,
           | Python) are typically called interpreted languages. Java is
           | usually called "compiled" because of the explicit step vs.
           | Ruby and Python, but it's essentially the same. And typically
           | you'll find discussions wrt to JVM about interpretation in
           | Java referring to bytecode interpretation vs. compiling being
           | JIT compiling.
           | 
           | Ultimately limiting interpreters to AST interpreters is not
           | quite correct. The AST is just another IR that source code
           | needs to be converted to just like bytecode. And the AST is
           | essentially also executed by a virtual machine.
           | Interpretation of the IR (the AST, or bytecode, etc.) is one
           | part of a VM. Of course in some instances the VMness of a
           | runtime is more pronounced (Java) than in others (Ruby).
           | 
           | The difference between interpretation and compilation is that
           | the latter is meant to run on real hardware vs. the former
           | implies something that executes the IR of a program by
           | dynamically choosing which machine instructions to run.
           | 
           | Of course a compiler is also something that takes in some
           | code and outputs some other, typically more low level
           | representation.
           | 
           | My point being I don't think there is a strict or consistent
           | definition these days for what an interpreter is.
           | 
           | Case in point: I've also heard people say interpreters read
           | the code "line by line" (or rather, not that but more
           | accurately, as far as they need to read before they know what
           | to execute next) and execute it piece by piece. Which might
           | be how some interpreters worked in some languages in the
           | past. An AST interpreter OTOH already implies some whole
           | source file pre-processing. Is an AST interpreter then less
           | of an interpreter than one that streams the source code line
           | by line. Is a program that does that more an interpreter than
           | another which goes a step further and, e.g. executes a
           | simplified AST, or one that executes a straightforward
           | trivial bytecode representation of the simplified AST?
        
         | CodeArtisan wrote:
         | A virtual machine executes instructions while an interpreter
         | takes source code as input. An interpreter could be built on
         | top of a virtual machine, obviously, but not necessary. For
         | example, SICP/PLAI/OOPLAI[1] explain how to implement an
         | interpreter on top of Scheme where you directly interpret the
         | s-expressions. These may be a worth read if you want to learn
         | about the usual concepts and techniques used in programming
         | language implementations from a more general point of view.
         | Like, for example, how to implement a static type system or an
         | object model based on classes. Interpreters based on a virtual
         | machine are actually compilers on the fly; the source code is
         | not interpreted but translated into instructions beforehand.
         | 
         | [1] http://sarabander.github.io/sicp/
         | 
         | http://cs.brown.edu/~sk/Publications/Books/ProgLangs/2007-04...
         | 
         | https://users.dcc.uchile.cl/~etanter/ooplai/
        
       | iTokio wrote:
       | By the author of the excellent https://c9x.me/compile/
       | 
       | A lightweight alternative to llvm
       | https://c9x.me/compile/doc/llvm.html
        
       | dmitriid wrote:
       | I'm not an expert by far, but I feel like many people who write
       | compilers or develop languages no longer recommend The Dragon
       | Book as it's very outdated.
       | 
       | Additionally, the vast majority of compiler books and resources
       | spend too much time on the easiest part of compiling: parsing.
       | It's not uncommon for a book or a course dedicate 60-70% of its
       | time to parsing. And leaves nothing to everything else.
        
         | MaxBarraclough wrote:
         | Yep, here's a thread from two years ago on exactly that:
         | https://news.ycombinator.com/item?id=20436399
        
         | rualca wrote:
         | > Additionally, the vast majority of compiler books and
         | resources spend too much time on the easiest part of compiling:
         | parsing.
         | 
         | I disagree. Parsing is the single most important feature of
         | building a programming language. It influences how the language
         | can be structured and also how it should be structured for
         | performance reasons.
         | 
         | Also, you're far more likely to write a parser for a language
         | than implement anything remotely related to compilers and
         | interpreters.
         | 
         | It's one thing to state that some topics might justify more
         | content, but it's absurd to downplay the fundamental importance
         | of parsers in putting together a compiler.
        
           | tom_mellior wrote:
           | The problem with the old old Dragon Book's treatment of
           | parsing is that it explains a particular impractical
           | algorithm (LR(1)? LR(k) maybe? it's been a while) in way too
           | much detail. You would need that detail if you were to
           | implement a parser generator (for that specific algorithm),
           | or a parser using that specific algorithm, which isn't
           | terribly well suited for generating useful error messages for
           | your users.
           | 
           | Today's computers are ridiculously powerful, and the
           | perceived performance advantages of LR parsing are simply
           | irrelevant. Further, even if you were to implement an LR
           | parser, you would do it with the aid of a parser generator,
           | which hides the boring details and saves you from needing to
           | understand the LR algorithm itself. But more likely when you
           | write a parser you use a parser generator using a different
           | approach, or you write a recursive descent parser by hand. In
           | either case, the dry parts of the old old Dragon Book are
           | completely irrelevant.
           | 
           | The people criticizing the 1986 Dragon Book's treatment of
           | parsing aren't saying that parsing is irrelevant. They are
           | saying that parsing _that specific way_ is either irrelevant
           | or done for you by a tool that is not the interesting part of
           | a compiler.
        
           | jhgb wrote:
           | > I disagree. Parsing is the single most important feature of
           | building a programming language. It influences how the
           | language can be structured and also how it should be
           | structured for performance reasons.
           | 
           | Wirth already settled that question (of both structuring the
           | language and performance) with recursive descent over thirty
           | years ago. The necessary treatment of all things parsing is
           | then covered on twenty pages of his Compiler Construction
           | text.
        
             | rualca wrote:
             | > Wirth already settled that question (of both structuring
             | the language and performance) with recursive descent over
             | thirty years ago.
             | 
             | "Recursive descent" is a whole class of parsing algorithms.
             | Claiming that recursive descent settled the question is a
             | kin to claim that compiled programming languages settled
             | the question of how to develop programming languages.
             | 
             | Moreover, even if you go as far as believing that a
             | specific class of parsing algorithms settled anything, it's
             | absurd to argue that people who are learning how to write a
             | parser should not learn about them, specially as your
             | personal choice has fundamental limitations such as left-
             | recursice operations and very specific techniques to
             | mitigate them.
             | 
             | And finally, we still see articles being accepted into
             | academic journals on how to parse concrete languages such
             | as JSON. I'm sure we can agree that slapping together
             | something with a random parser generator is not something
             | that servers the students' best interests.
        
               | jhgb wrote:
               | It turns out that whatever limitations there are to that
               | approach hardly matter in practice, since Wirth developed
               | a wide variety of languages this way, and they work just
               | fine.
        
               | rualca wrote:
               | > It turns out that whatever limitations there are to
               | that approach hardly matter in practice (...)
               | 
               | How do you know if you do not learn about parsers?
               | 
               | Do you see where I am going with this?
               | 
               | It's absurd to claim that parsers are not fundamental and
               | central to develop compilers. That statement is patently
               | false, specially considering that some programming
               | languages are notoriously challenging to parse.
        
               | jhgb wrote:
               | > That statement is patently false, specially considering
               | that some programming languages are notoriously
               | challenging to parse.
               | 
               | Sure, if you create an artificial problem in the first
               | place, you need an artificial solution for it. That's why
               | careful language design is even more important than
               | parsers.
        
           | HelloNurse wrote:
           | If parsing "influences how the language can be structured and
           | also how it should be structured for performance reasons"
           | it's probably going to be a mediocre language, in which self-
           | imposed constraints interfere with useful special features
           | resulting in a bland design lacking competitive advantages.
           | 
           | Pascal has already been cited in other comments as having
           | been designed for ease of parsing, but the Turbo Pascal
           | compiler is an isolated masterpiece, and high-quality
           | compilation at decent speed is normally more useful than
           | cheap compilation at ludicrous speed.
           | 
           | Maybe parsing seems a large part of constructing a compiler
           | because it's naturally the first task and because it's likely
           | to be entangled with designing the syntax of the language,
           | which can pose difficulties (e.g. ambiguities, context
           | sensitivity, messy operator precedence).
        
           | dmitriid wrote:
           | > Parsing is the single most important feature of building a
           | programming language
           | 
           | It realy isn't. Also, it's the most trivial part of parsing a
           | programming language (in a compiler).
           | 
           | > It influences how the language can be structured and also
           | how it should be structured for performance reasons.
           | 
           | Once again, no, not really. However complex your language is,
           | it can be parsed at 100 GB/s on a Raspberry Pi. Even if you
           | do it with regexps. Everything that comes _after_ parsing is
           | very complex and is rarely explained in any detail. And
           | everything that comes _after_ actualy _directly_ influences
           | the structure, the performance etc.
           | 
           | It really makes zero difference if you decide that your
           | typeing info should be written like this:                  T
           | x = new T();
           | 
           | or like this:                  let x: T = T()
           | 
           | or like this:                  var x := T()
           | 
           | All this is trivial to parse. However, deciding on how you
           | implement typing, what needs to be in the languaage to
           | support your choice etc., now _that_ definitely influences
           | the structure, performance, trade-offs etc.
           | 
           | How much of that is in the Dragon Book? Next to zero? But
           | surely there are 4 chapters on how to parse this.
        
             | rualca wrote:
             | > It realy isn't. Also, it's the most trivial part of
             | parsing a programming language (in a compiler).
             | 
             | No,not really. I mean, think about it for a second: while
             | you cannot get a compiler out of the door withot a parser,
             | you can indeed leave all the fancy optimization features
             | out, don't you? And if you want to get a MVP out of the
             | door, do you focus your effort in developing optional
             | features or the basic, critical part that ensures that your
             | language exists?
             | 
             | And then there's developer experience. Syntax/grammar
             | errors? You need the parser for that, don't you agree?
             | 
             | > Once again, no, not really. However complex your language
             | is, it can be parsed at 100 GB/s on a Raspberry Pi.
             | 
             | Actually, you're quite wrong. RapidJSON only handles a very
             | basic language, and doesn't need to handle any weird errors
             | to output meaningful errors, and it boasts at most 1GB/s.
             | 
             | And this is a parser optimized for speed.
             | 
             | Perhaps if more people had a clue about compilers, this
             | sort of errors and misconceptions wouldn't be so common.
        
               | estebank wrote:
               | > while you cannot get a compiler out of the door withot
               | a parser, you can indeed leave all the fancy optimization
               | features out, don't you? And if you want to get a MVP out
               | of the door, do you focus your effort in developing
               | optional features or the basic, critical part that
               | ensures that your language exists?
               | 
               | > And then there's developer experience. Syntax/grammar
               | errors? You need the parser for that, don't you agree?
               | 
               | These two statements are at odds with each other as
               | having great developer experience is indeed a "fancy
               | optimization" that you shouldn't spend too much time on
               | when starting a project (and I say that as someone
               | spending their time every day thinking about making
               | compiler errors better), and part of the reason the
               | literature around parsers is problematic is because
               | graceful error recovery isn't explored in detail.
               | 
               | > And this is a parser optimized for speed.
               | 
               | Parse time is never the bottleneck. It is a waste of time
               | to focus on optimizing parse speed at the beginning. It
               | is worth it to eventually measure and tune it, and to
               | think about the implications when designing your grammar,
               | but beyond that the slowness of compilers is never driven
               | by their parsers.
               | 
               | > you cannot get a compiler out of the door withot a
               | parser
               | 
               | Technically, you can: you make your compiler something
               | that consumes an already existing bytecode (like the
               | JVM's or .Net's CLI) or only implement an AST parser that
               | isn't intended for humans to actually write while you
               | nail down the specifics.
               | 
               | > Syntax/grammar errors? You need the parser for that,
               | don't you agree?
               | 
               | Error recovery in a parser is a tricky thing: it is
               | intimately tied to the choices you've made in your
               | grammar. The more redundancy and uniqueness between
               | scopes you can have, the easier it can be for your parser
               | to identify where things have gone badly. For example, in
               | python if you write                   if foo = bar:
               | print(foo)
               | 
               | you have little information but can still infer that you
               | likely wanted to compare equality instead using ==. But
               | because Python has a very flexible grammar and semantics,
               | the following is valid but likely not what was intended:
               | class Foo:             def bar(self):
               | print("hi!")              foo = Foo()         foo.bar =
               | "hi!"
               | 
               | If you make different more restrictive decisions on both
               | grammar and semantics you can still provide tools to
               | allow people to do this, by being more verbose, but
               | people who do this by accident can be given excellent
               | diagnostics.
        
               | dmitriid wrote:
               | > I mean, think about it for a second: while you cannot
               | get a compiler out of the door withot a parser, you can
               | indeed leave all the fancy optimization features out,
               | don't you? And if you want to get a MVP out of the door,
               | do you focus your effort in developing optional features
               | or the basic, critical part that ensures that your
               | language exists?
               | 
               | Exactly: _an MVP_. For an _MVP_ the parser may indeed be
               | "the single most important feature of building a
               | programming language.". But even then it's almost
               | definitely isn't.
               | 
               | The moment you "leave out fancy optimisation fetures"
               | your language becomes at most mediocre.
               | 
               | Because between parsing (which is trivial) and "fancy
               | optimisation features" (which come very late in a
               | compiler life) there are a million things that are
               | significantly more important than parsing:
               | 
               | - how do you do typing?
               | 
               | - how do you code generation (do you output machine code
               | yourself or use a backend)?
               | 
               | - do you use an intermediate representation?
               | 
               | - what are the actual optimisations on performance (none
               | of which have anything to do with parsing)? As an
               | example, C++'s horrible header system has nothing to do
               | with parsing, but slows compiling by hundreds of orders
               | of magnitude.
               | 
               | - how do you do graceful error recovery and reporting?
               | 
               | - how are lambdas handled? how are scopes handled? memory
               | layouts? GC/ARC/manual memory management? standard lib?
               | 
               | - a million other things
               | 
               | > Actually, you're quite wrong.
               | 
               | It was a hyperbole. Do you even understand how fast 1GB/s
               | is? It roughly 10 to 20 million lines of code [1] Windows
               | 10 is roughly 50 million lines [2]. So, at 1GB/s you can
               | parse the _entirety of Windows 10 codebase_ in under 5
               | seconds.
               | 
               | Even if your parsing speed is 1000 times slower (3 orders
               | of magnitude slower), you can still parse ~10k lines a
               | second. And I honsetly can't even begin to think how you
               | can parse something this slowly. ANd if you do, it's
               | still not a concern, because the rest comes from outside
               | parsing, for example:
               | 
               | - can yo run parsers in parallel?
               | 
               | - do you need to parse the entirety of your project to
               | begin compilation, or your files/modules are isolated and
               | compiler can work on any of those?
               | 
               | Parsing is _trivial_.
               | 
               | [1] For example, this SO answer for an unrelated question
               | creates a file with 150 million lines, and the result is
               | 6GB. 150/6 = 25, I went for even lower numbers.
               | 
               | [2] https://answers.microsoft.com/en-
               | us/windows/forum/all/window...
        
         | tom_mellior wrote:
         | > I'm not an expert by far, but I feel like many people who
         | write compilers or develop languages no longer recommend The
         | Dragon Book as it's very outdated.
         | 
         | Yes. There are newer editions that I can't judge, but the 1986
         | edition linked from the page should nowadays only be read by
         | historians. If you want to implement a compiler, look
         | elsewhere, for something newer.
         | 
         | This isn't to say that this was a bad book 35 years ago. It
         | wasn't. But it is simply outdated. It does not contain things
         | that are important today. It has very little treatment of type
         | systems and type inference, for example. Its treatment of data
         | flow analysis is based on intervals, which have limitations
         | (though also some interesting theoretical advantages) and are
         | used by zero compilers written in this millennium (and not even
         | described very well in the Dragon book IIRC). And then of
         | course there is the parsing problem, see my rant elsethread.
         | 
         | As for the article, I think A Catalogue of Optimizing
         | Transformations (https://www.clear.rice.edu/comp512/Lectures/Pa
         | pers/1971-alle...) would make an interesting addition.
         | 
         | Edit: Forgot to add that the old old Dragon book doesn't cover
         | two topics that the article itself considers very relevant: SSA
         | form (which I believe was discovered just around that time) and
         | linear scan register allocation (which is much newer).
        
           | aarchi wrote:
           | What would you recommend that covers data flow analysis
           | better than the dragon book? I'm currently starting data flow
           | analysis for my own compiler using intervals and/or modular
           | congruences (operations can be reduced to congruences, then
           | combined with the Chinese Remainder Theorem). What are the
           | theoretical advantages of intervals and what do you recommend
           | over that?
        
             | tom_mellior wrote:
             | Ah, sorry, I should probably clarify that there are two
             | completely distinct uses of the term "interval" in data
             | flow analysis.
             | 
             | One (introduced in https://amturing.acm.org/p137-allen.pdf
             | and discussed in the old Dragon Book) groups _program
             | points_ into nested  "intervals", i.e., intervals are
             | pieces of code. Analysis information is then propagated
             | among these intervals. This is what I was referring to
             | above. As far as I know this approach is completely
             | obsolete by now, with everyone using methods based more or
             | less indirectly on Kildall's worklist iteration approach (h
             | ttp://static.aminer.org/pdf/PDF/000/546/451/a_unified_appro
             | ...). The theoretical advantage I was referring to above
             | was that the interval approach is AFAIR sometimes more
             | efficient because it can take program structure (loop
             | nesting) into account more directly.
             | 
             | The _other_ meaning of  "intervals" is used for an
             | _analysis domain_ , i.e., the kind of data you want your
             | analysis to compute. For example, you might determine that
             | the possible values of some variable x are in the interval
             | [10, 20], the values of y are in the interval [5, 15], and
             | then you can compute that the values of z = x + y must be
             | in [15, 35]. There is nothing wrong with this approach, and
             | any optimizing compiler is pretty much forced to use
             | something like this. Based on your reference to modular
             | congruences I think you probably want to do analysis of
             | numerical values and meant this sense, which is anyway the
             | only one in common use nowadays. Go ahead and do this.
             | Extensions are with congruences, like "x is in [10, 20] and
             | also congruent to 1 modulo 4" and/or "known bits", like "x
             | is in [10, 20], and I know that (counting from the LSB) its
             | bit 2 is 0 and bit 3 is 1".
             | 
             | Personally I learned about data flow analysis in a
             | university course based on Principles of Program Analysis:
             | https://www.springer.com/gp/book/9783540654100. It's a book
             | that goes into a lot of mathematical depth but not much
             | breadth of analyses, and it doesn't handle data flow
             | analysis on SSA form. Then I learned additional details
             | from various well-known papers like the ones linked above,
             | and the classical early papers around analysis on SSA form,
             | like https://www.cse.wustl.edu/~cytron/531Pages/f11/Resourc
             | es/Pap.... I'm sure there must be a more straightforward
             | text to get you there, but I don't think I can give a
             | definite recommendation. I have vague memories of the
             | coverage in Muchnick's "Advanced Compiler Design and
             | Implementation" and/or Cooper and Torczon's "Engineering a
             | Compiler" being reasonable, but I can't find my copies
             | right now.
        
         | Rochus wrote:
         | > no longer recommend The Dragon Book as it's very outdated
         | 
         | Only if they didn't realize there is a second edition from
         | 2006.
        
       | testrvb wrote:
       | test
        
       ___________________________________________________________________
       (page generated 2021-04-24 23:00 UTC)