[HN Gopher] Resources for Amateur Compiler Writers ___________________________________________________________________ Resources for Amateur Compiler Writers Author : kilodeca Score : 285 points Date : 2021-04-24 14:58 UTC (8 hours ago) (HTM) web link (c9x.me) (TXT) w3m dump (c9x.me) | chandlore wrote: | Latest versions of the ABI specifications linked in the _Machine | Specific_ section | | ARM: https://github.com/ARM-software/abi-aa/releases | | x86-64: https://gitlab.com/x86-psABIs/x86-64-ABI (go to most | recent CI job and download artifacts for a compiled PDF) | hardwaregeek wrote: | This is definitely a backend, machine code focused list. If | you're writing a compiler to WebAssembly or even the JVM you're | probably not gonna need a lot of these resources. | | Indeed I'd argue that these lists are more for an intermediate | compiler writer. Beginner compiler writers don't need to know | register allocation. They need to understand how to think about | language translation, how to write an intermediate | representation, how to learn a target language. | | I've been toying with the idea of writing a post that goes | through the thought process of translation. Specifically, how you | build states, and invariants. When I write code generation I | spend time thinking about the different states the stack could be | in, or what invariant I need to keep across this expression (do I | need to save a register or global?). | | One simple but important example of this is realizing that given | a proper, sound type checker, your code generator should almost | never give errors. By the time you reach code generation you | should have proven enough invariants that code generation cannot | fail. This isn't always true but I've found it to be the case. | seanmcdirmid wrote: | That reminds me of the dragon book that goes heavily into VAX | assembly and optimizing for CISC machines. Probably not the | best resource for beginners or amateurs these days, but it won | a Turing award at least. | | The problem is what it means to write a compiler is pretty | broad, some amateur projects focus on backend work, some on | front end, many in between. | coryrc wrote: | Also tremendous amounts just on parsing C. I found Lisp In | Small Pieces far more interesting. | enos_feedler wrote: | You can certain take any path into compiler writing. I dont | know anything about AST or front end compilers but ended up | writing register allocators and constant folding for nvidia GPU | JIT | jstanley wrote: | > Indeed I'd argue that these lists are more for an | intermediate compiler writer. Beginner compiler writers don't | need to know register allocation. | | I'm not disagreeing with you, but note that the title does not | say it's for beginners, it says it's for amateurs, which is | just anyone who does it for fun rather than money. | orthoxerox wrote: | > Beginner compiler writers don't need to know register | allocation. They need to understand how to think about language | translation, how to write an intermediate representation, how | to learn a target language. | | I don't think people that want to write a compiler want to end | up with helloworld.c instead of helloword.exe. At least I | don't. | fooker wrote: | Ideally, you'd emit LLVM IR or some similar bytecode or | assembly. Then you'd use llc/opt or an appropriate assmebler | to get an executable. | | Trying to emit machine code isn't too interesting when you're | writing your first compiler. | CalChris wrote: | Similarly, trying to parse C++ isn't too interesting when | you're writing your first backend. | hardwaregeek wrote: | At least for me, compiling to a simpler target made learning | the ideas easier. Compiling to to raw assembly does give you | more nerd credit but it also makes it harder to get started. | ajkjk wrote: | The opposite, really. I started learning about compilers | because I wanted to implement my own syntactic ideas. Don't | care at all how it runs. | bootwoot wrote: | The creator of C++ would disagree, as C++ was devised as a | compile-to-c language. | jhgb wrote: | Perhaps Cooper's and Torczon's "Engineering a Compiler (2nd | Edition)" should be among the entries in the "Books" category. | estebank wrote: | A lot of ink is spilled on the details of how to make compilers | go from taking text and turn it into fast executables, but I | believe that the User Experience of compilers is as important a | subject as those, but there's very little information about it in | the literature, outside of a handful of papers and the design | documents of existing compilers that do focus on this. | dmitriid wrote: | No idea why you are being downvoted, but it's true. | | Compiler UX is often horrible, but the new breed of languages | tries to rectify parts of it: | | - non-crpytic error messages. Show _exactly_ what failed, where | it failed, how to solve it. | | - compiler-as-a-service. Tool makers will love you. | | - compilation speed. If I need to wait 2 minutes for a 10 LOC | project to compile, your compiler is doing something wrong. | | - the compiler should be the only tool you need. If you rely on | 15 different tools just to compile a project, you're doing | something wrong | estebank wrote: | As you allude to, treating a compiler as an isolated batch | process is an outdated strategy. Any compiler started today | should be designed such that it can be used in an IDE from | the start, even if you don't implement it in that way until | after you've reached a certain level of maturity. | | For error messages in particular is a very expansive topic | because a compiler might have the following components: | Lexer Parser Macro Expansion Name | Resolution Lowering Type Checking | Privacy Checking Lifetime Analysis Code | Generation Linking | | and _all of them_ can produce errors. If you introduce error | recovery in your parser, then you better make sure you 're | carrying that information forward to name resolution and type | checking. If you introduce a error type placeholder, better | account for it during the rest of your analysis to avoid | complaining about non-existing methods on something that | wasn't resolved in the first place. Error deduplication. | Exploring the grammar space to identify things that a human | may attempt but isn't allowed for good reason but that will | not blow up until some quasi-random stage. Now that I think | about it you could write a _whole book_ about these topic. | tomcam wrote: | What is lowering? | hctaw wrote: | Some languages require higher level internal | representations ("IR"s) to be "lowered" to lower level | representations before code generation - in other words | like high level semantics or syntactic sugar and | transform into simpler constructs to compile. | | For example, a language could support special iterator | semantics that will be "lowered" into a simple for loop. | Or a for loop itself may be lowered into a do-while loop, | depending on the design of the lowered IR. | hctaw wrote: | I downvoted it because I think it's a meme that ignores where | contemporary compiler engineering is happening. We don't even | realize they exist anymore - V8, Graal, Roslyn, etc are all | optimizing JIT compilers running in the background that can | compile multiple languages on the fly. And they do a great | job. | | Not everything is a C++ compiler, which is inherently | difficult to compile correctly. Let alone quickly. | | I think the future of compilers is that everything will have | an incremental parser and JIT compiler with optional AOT | compilation for release builds. Hopefully LSP and DAP will | take over the world of tooling integration. | | There are other fundamental mistakes to look at than error | messages and performance, like pretending that build systems | don't exist and that linking multiple compilation units isn't | something the compiler can do. Compiling code is getting much | easier, but building it is much harder. Compilers have the | most information and internal structure to handle all our | build problems, but don't do it. | turminal wrote: | > - non-crpytic error messages. Show exactly what failed, | where it failed, how to solve it. | | I'm working on a compiler in my free time and I agree with | you, but in defense of compilers that suck at this: error | reporting is really, really hard to do right. | Rochus wrote: | There is a more recent edition of "Compilers: Principles, | Techniques, and Tools" worth considering: | https://www.amazon.com/Compilers-Principles-Techniques-Tools.... | ChicagoDave wrote: | I've toyed with compiler stuff for years. My current idea is a | fluent/piping language. Is there any reference that would help | with that? | [deleted] | alanafarias22 wrote: | Amei esse posto | alanafarias22 wrote: | Esses comentarios ajudam muito | andromeduck wrote: | Wish there were more resources about or tools for bidirectional | parsing. | hasitseth wrote: | Alan Holub's 'Compiler Design in C' is not mentioned in the | books. Holub's book had well-written code. I had learned a lot | from his code. | elvis70 wrote: | And a free copy of this book in PDF has been made available by | the author: https://holub.com/compiler/ | failwhaleshark wrote: | ProTips after writing numerous kinds of compilers in university: | | Recursive descent with backtracking is much easier to construct | and maintain for diagnostics than bison/flex. | | Derivative parsing is very interesting. | ahelwer wrote: | Does anybody know of good introductory resources on error | recovery techniques when writing a parser & interpreter? Bonus | points if they use combinators instead of a hand-written parser. | ruste wrote: | I've actually been meaning to write something up about this. | I've found you can get fairly good error messages from a parser | combinator style parser just by saving the deepest location you | successfully parsed to and the parser you failed in. | jstimpfle wrote: | Not a resource, but depending on what kind of parser you are | doing, maybe an idea. | | The thing I'm going to try out next for one of my batch parsers | (written recursive descent style as many or most parsers in the | wild) is to have AST nodes of kind "invalid". The idea is that | the parser should always complete, with minimal in-line error | checking, even when I have say "require_token_kind()" or | similar calls in my parser code. | | Maybe not even an "invalid" kind, but an additional "invalid" | flag, so that even those invalid nodes can still have all the | regular node-kind-depending structure on them. | | The other obvious idea is to use a parser generator. But I'm | pretty sure I wouldn't be happy on that route - from the little | experiments that I've done, I infer it's too bureaucratic, too | much boilerplate for me to suffer. | | The more general advice for error handling is that you should | try to make APIs that require very little in-line error | handling. Instead, keep errors behind the APIs, maybe collect | them in a list, or only keep the first or last error, and allow | to handle them at a later point (out-of-line). | | And boom, most of the talk about smart syntax and type systems | and other clever systems to make error handling more convenient | suddenly becomes obsolete. | | Yet more abstractly, don't try to build systems that make | cumbersome things more convenient, but think how you can do | with less of these cumbersome things. (That works wonders for | runtime performance, too). | tester756 wrote: | >The more general advice for error handling is that you | should try to make APIs that require very little in-line | error handling. Instead, keep errors behind the APIs, maybe | collect them in a list, or only keep the first or last error, | and allow to handle them at a later point (out-of-line). | | I guess you wanted to say that instead of printing just | what's wrong at some point, be smarter and try to build some | handier error handling like attaching error to ast's node and | then maybe do something with it | jstimpfle wrote: | The important point is the temporal decoupling between the | time where the error is detected and the time where it is | handled. | [deleted] | UncleEntity wrote: | _The Zephyr Abstract Syntax Description Language_ , Wang et al. | | Has been invaluable for creating the AST (and soon the IR | nodes)-- though it has been a source of much yak shaving as I've | rewritten my asdl generator at least 3 times. | | _Language Implementation Patterns: Create Your Own Domain- | Specific and General Programming Languages Book_ , Terence Parr | | Pretty java/antlr focused but wasn't too hard to take the | concepts and apply them to a C++ based transpiler I've been | slowly poking at. | | _Combining Analyses, Combining Optimizations_ , Cliff Click | | His thesis and an extension of the linked paper _Global Code | Motion, Global Value Numbering_ -- my next step down the rabbit | hole... | | _Tree Automata Techniques and Applications_ , Comon et al. | | Still on the fence on this one vs the iburg/burs algorithm, | probably be more useful and match better with the way Parr's book | goes about things. | phonebucket wrote: | One thing I find curious: lots of compiler writing seems to be | done in Haskell, but I can hardly find any Haskell-based | resources for writing compilers. | | Can anyone here please recommend any such resources? | toolslive wrote: | https://www.microsoft.com/en-us/research/publication/impleme... | atsuzaki wrote: | This is a pretty good tutorial that I know of: | https://www.stephendiehl.com/llvm/. The author went on to write | a book too, though it's still in progress | http://dev.stephendiehl.com/fun/ | tester756 wrote: | >lots of compiler writing seems to be done in Haskell | | those 4fun of real world? | mechanicalDigit wrote: | Most compilers are written for fun. That's the point of the | thread. | | Also, there is a lot of modern language research is done from | the "lens" (heh) of Haskell, or at least the Glasgow school. | piinbinary wrote: | Are there any good resources for building a runtime and codegen | for languages with garbage collection? | moonchild wrote: | For a basic GC, code generation should be unaffected. The only | place you'd need to care about code generation is if you wanted | to make a realtime GC, inserting read or write barriers. | piinbinary wrote: | My understanding from doing a bit of reading is that there | are at least two things codegen-adjacent that an accurate | collector will want (even if you stop the world while | collecting): | | 1. The locations of pointers inside of allocations. | | 2. The locations of pointers in the stack (into which you may | need to spill registers). | | I can think of ways to do these (e.g. perhaps each stack | frame contains a bitmask at a certain offset), but I'd love | to learn what the best practices are. | moonchild wrote: | Usually your objects are boxed, and the box contains a type | identifier which the gc can use to identify pointer | offsets. It is apparently possible to do precise garbage | collection without boxes--I'll see if I can find the paper | and link it later--but this is not widely done. | aardvark179 wrote: | The Garbage Collection Handbook is excellent, but might go into | more depth than you need. For a simple garbage collector there | needn't be any effect on codegen, all you need to worry about | is the runtime side of things. Search for how to build a | conservative garbage collector and you should find plenty of | tutorials and introductory CS courses on the subject. | nethunters wrote: | Any similar resources on writing interpreters? | | Also is there are a technical difference between an interpreter | and a VM? | mtlynch wrote: | I haven't read it, but Bob Nystrom's _Crafting Interpreters_ | has been on my list for a while: | | https://craftinginterpreters.com/ | smueller1234 wrote: | Don't wait, read it. It's fun and informative, worth every | minute! I haven't had as much fun reading a technical book in | years! | orthoxerox wrote: | It's a great book. Bob's not a CS teacher, and it's great: | the book is grounded in practice and has been structured in a | way that lets you get a tangible result at the end of every | chapter. | LordN00b wrote: | I second this. It's a great resource, very well written. | Opinionated enough to have character and humour, but not so | dry as to make your eyes glaze over. I guess I'm saying that | for a technical tome it's very, very readable (as was his | Game Design Patterns book incidentally) | theodpHN wrote: | Check out Peter Norvig's neat BASIC Interpreter | https://github.com/norvig/pytudes/blob/master/ipynb/BASIC.ip... | MH15 wrote: | Interpreters usually execute by walking the abstract syntax | tree at runtime, bytecode VMs usually execute actual bytecode | similar to machine code. The bytecode for such VMs is often way | simpler than the instruction set for an actual computer, so | it's far more portable between platforms. | tom_mellior wrote: | > Interpreters usually execute by walking the abstract syntax | tree at runtime, bytecode VMs usually execute actual bytecode | similar to machine code. | | This isn't the right way to separate these concepts. A VM | that executes bytecodes by interpretation is also an | interpreter (Python, Ruby, PHP are well-known examples). Many | VMs don't interpret, or they interpret mostly during startup, | but actually try to compile hot parts of the code and execute | that machine code rather than interpreting (Java and | JavaScript VMs are well-known examples). | | The VM part more properly refers to things like data | representation and memory management. Interpreters of an | abstract syntax tree, interpreters of bytecodes, and just in | time compilers all need to run inside a system that takes | care of loading code, allocating memory when requested, and | often doing garbage collection. These services are what make | a VM. The exact execution strategy is not what determines VM- | ness. | nethunters wrote: | Thanks for clearing that up. | | Which languages or implementations of languages directly | interpret the AST without the intermediary bytecode | compilation? | | I know Python, Java and JavaScript (V8 and SpiderMonkey) all | compile to bytecode first probably to speed up subsequent | runs and run some optimisations on the bytecode. | | What other benefits are there to compiling to bytecode first | vs directly interpreting the AST? | smueller1234 wrote: | Perl is a tree based interpreter. It's not exactly an AST | anymore but the time it's being executed, but close enough. | herohamp wrote: | Portability. Say I wanted to make language X run on all | platforms, but I didn't actually care about compiling it on | all platforms. I can just write a relatively simple VM for | each platform. This is one of the reasons Java was and | still kinda is so ubiquitous | nethunters wrote: | Wouldn't writing an interpreter for each platform be less | work and achieve the same goal as writing a VM for each | platform? | | Edit: ^Aside from being able to execute the bytecode on | any platform | kaba0 wrote: | As others mentioned, source code should be distributed | that way, and I think creating a simple VM is easier than | a simple language parser. But of course, an optimizing | one can be really quite complex in both cases. | klyrs wrote: | Why would it be less work? The interpreter will need to | implement whatever operations a VM can perform, so a | priori it's _at least as much_ work. Bonus, if you can | bootstrap the source- >bytecode process, then you only | need to write (and compile) that once to get a full- | fledged interpreter on every host with a VM | CalChris wrote: | If you compile to AST and walk that then your portability | is at the source level; you have to send the source over | the wire; each target then needs a parser+walker and each | target needs to parse+walk. If you compile to bytecode you | can send bytecode over the wire and then simply interpret | that bytecode. | jhgb wrote: | Gambit Scheme, CLISP, CMUCL are capable of interpreting | ASTs, and I believe (although I'm not 100% sure) that this | is the case for Lispworks and Allegro CL as well. Also, | Ruby until version 1.8. | IainIreland wrote: | One major benefit of compiling to bytecode first is that | bytecode is a more convenient shared source of truth. | | For example, SpiderMonkey has two interpreter and two | compiler tiers. The output of the parser is bytecode. We | can interpret it immediately, and it's a convenient input | to the compilers. It also simplifies transitions between | tiers: for example, when we bail out of Warp, we resume at | a specific bytecode offset in the baseline interpreter. | | I'm not sure how you would resume at a particular point in | the execution of an AST-based interpreter without agreeing | on a linearized traversal of the AST, which is 90% of the | way to bytecode. | quelltext wrote: | Keep in mind that bytecode interpreted languages (Ruby, | Python) are typically called interpreted languages. Java is | usually called "compiled" because of the explicit step vs. | Ruby and Python, but it's essentially the same. And typically | you'll find discussions wrt to JVM about interpretation in | Java referring to bytecode interpretation vs. compiling being | JIT compiling. | | Ultimately limiting interpreters to AST interpreters is not | quite correct. The AST is just another IR that source code | needs to be converted to just like bytecode. And the AST is | essentially also executed by a virtual machine. | Interpretation of the IR (the AST, or bytecode, etc.) is one | part of a VM. Of course in some instances the VMness of a | runtime is more pronounced (Java) than in others (Ruby). | | The difference between interpretation and compilation is that | the latter is meant to run on real hardware vs. the former | implies something that executes the IR of a program by | dynamically choosing which machine instructions to run. | | Of course a compiler is also something that takes in some | code and outputs some other, typically more low level | representation. | | My point being I don't think there is a strict or consistent | definition these days for what an interpreter is. | | Case in point: I've also heard people say interpreters read | the code "line by line" (or rather, not that but more | accurately, as far as they need to read before they know what | to execute next) and execute it piece by piece. Which might | be how some interpreters worked in some languages in the | past. An AST interpreter OTOH already implies some whole | source file pre-processing. Is an AST interpreter then less | of an interpreter than one that streams the source code line | by line. Is a program that does that more an interpreter than | another which goes a step further and, e.g. executes a | simplified AST, or one that executes a straightforward | trivial bytecode representation of the simplified AST? | CodeArtisan wrote: | A virtual machine executes instructions while an interpreter | takes source code as input. An interpreter could be built on | top of a virtual machine, obviously, but not necessary. For | example, SICP/PLAI/OOPLAI[1] explain how to implement an | interpreter on top of Scheme where you directly interpret the | s-expressions. These may be a worth read if you want to learn | about the usual concepts and techniques used in programming | language implementations from a more general point of view. | Like, for example, how to implement a static type system or an | object model based on classes. Interpreters based on a virtual | machine are actually compilers on the fly; the source code is | not interpreted but translated into instructions beforehand. | | [1] http://sarabander.github.io/sicp/ | | http://cs.brown.edu/~sk/Publications/Books/ProgLangs/2007-04... | | https://users.dcc.uchile.cl/~etanter/ooplai/ | iTokio wrote: | By the author of the excellent https://c9x.me/compile/ | | A lightweight alternative to llvm | https://c9x.me/compile/doc/llvm.html | dmitriid wrote: | I'm not an expert by far, but I feel like many people who write | compilers or develop languages no longer recommend The Dragon | Book as it's very outdated. | | Additionally, the vast majority of compiler books and resources | spend too much time on the easiest part of compiling: parsing. | It's not uncommon for a book or a course dedicate 60-70% of its | time to parsing. And leaves nothing to everything else. | MaxBarraclough wrote: | Yep, here's a thread from two years ago on exactly that: | https://news.ycombinator.com/item?id=20436399 | rualca wrote: | > Additionally, the vast majority of compiler books and | resources spend too much time on the easiest part of compiling: | parsing. | | I disagree. Parsing is the single most important feature of | building a programming language. It influences how the language | can be structured and also how it should be structured for | performance reasons. | | Also, you're far more likely to write a parser for a language | than implement anything remotely related to compilers and | interpreters. | | It's one thing to state that some topics might justify more | content, but it's absurd to downplay the fundamental importance | of parsers in putting together a compiler. | tom_mellior wrote: | The problem with the old old Dragon Book's treatment of | parsing is that it explains a particular impractical | algorithm (LR(1)? LR(k) maybe? it's been a while) in way too | much detail. You would need that detail if you were to | implement a parser generator (for that specific algorithm), | or a parser using that specific algorithm, which isn't | terribly well suited for generating useful error messages for | your users. | | Today's computers are ridiculously powerful, and the | perceived performance advantages of LR parsing are simply | irrelevant. Further, even if you were to implement an LR | parser, you would do it with the aid of a parser generator, | which hides the boring details and saves you from needing to | understand the LR algorithm itself. But more likely when you | write a parser you use a parser generator using a different | approach, or you write a recursive descent parser by hand. In | either case, the dry parts of the old old Dragon Book are | completely irrelevant. | | The people criticizing the 1986 Dragon Book's treatment of | parsing aren't saying that parsing is irrelevant. They are | saying that parsing _that specific way_ is either irrelevant | or done for you by a tool that is not the interesting part of | a compiler. | jhgb wrote: | > I disagree. Parsing is the single most important feature of | building a programming language. It influences how the | language can be structured and also how it should be | structured for performance reasons. | | Wirth already settled that question (of both structuring the | language and performance) with recursive descent over thirty | years ago. The necessary treatment of all things parsing is | then covered on twenty pages of his Compiler Construction | text. | rualca wrote: | > Wirth already settled that question (of both structuring | the language and performance) with recursive descent over | thirty years ago. | | "Recursive descent" is a whole class of parsing algorithms. | Claiming that recursive descent settled the question is a | kin to claim that compiled programming languages settled | the question of how to develop programming languages. | | Moreover, even if you go as far as believing that a | specific class of parsing algorithms settled anything, it's | absurd to argue that people who are learning how to write a | parser should not learn about them, specially as your | personal choice has fundamental limitations such as left- | recursice operations and very specific techniques to | mitigate them. | | And finally, we still see articles being accepted into | academic journals on how to parse concrete languages such | as JSON. I'm sure we can agree that slapping together | something with a random parser generator is not something | that servers the students' best interests. | jhgb wrote: | It turns out that whatever limitations there are to that | approach hardly matter in practice, since Wirth developed | a wide variety of languages this way, and they work just | fine. | rualca wrote: | > It turns out that whatever limitations there are to | that approach hardly matter in practice (...) | | How do you know if you do not learn about parsers? | | Do you see where I am going with this? | | It's absurd to claim that parsers are not fundamental and | central to develop compilers. That statement is patently | false, specially considering that some programming | languages are notoriously challenging to parse. | jhgb wrote: | > That statement is patently false, specially considering | that some programming languages are notoriously | challenging to parse. | | Sure, if you create an artificial problem in the first | place, you need an artificial solution for it. That's why | careful language design is even more important than | parsers. | HelloNurse wrote: | If parsing "influences how the language can be structured and | also how it should be structured for performance reasons" | it's probably going to be a mediocre language, in which self- | imposed constraints interfere with useful special features | resulting in a bland design lacking competitive advantages. | | Pascal has already been cited in other comments as having | been designed for ease of parsing, but the Turbo Pascal | compiler is an isolated masterpiece, and high-quality | compilation at decent speed is normally more useful than | cheap compilation at ludicrous speed. | | Maybe parsing seems a large part of constructing a compiler | because it's naturally the first task and because it's likely | to be entangled with designing the syntax of the language, | which can pose difficulties (e.g. ambiguities, context | sensitivity, messy operator precedence). | dmitriid wrote: | > Parsing is the single most important feature of building a | programming language | | It realy isn't. Also, it's the most trivial part of parsing a | programming language (in a compiler). | | > It influences how the language can be structured and also | how it should be structured for performance reasons. | | Once again, no, not really. However complex your language is, | it can be parsed at 100 GB/s on a Raspberry Pi. Even if you | do it with regexps. Everything that comes _after_ parsing is | very complex and is rarely explained in any detail. And | everything that comes _after_ actualy _directly_ influences | the structure, the performance etc. | | It really makes zero difference if you decide that your | typeing info should be written like this: T | x = new T(); | | or like this: let x: T = T() | | or like this: var x := T() | | All this is trivial to parse. However, deciding on how you | implement typing, what needs to be in the languaage to | support your choice etc., now _that_ definitely influences | the structure, performance, trade-offs etc. | | How much of that is in the Dragon Book? Next to zero? But | surely there are 4 chapters on how to parse this. | rualca wrote: | > It realy isn't. Also, it's the most trivial part of | parsing a programming language (in a compiler). | | No,not really. I mean, think about it for a second: while | you cannot get a compiler out of the door withot a parser, | you can indeed leave all the fancy optimization features | out, don't you? And if you want to get a MVP out of the | door, do you focus your effort in developing optional | features or the basic, critical part that ensures that your | language exists? | | And then there's developer experience. Syntax/grammar | errors? You need the parser for that, don't you agree? | | > Once again, no, not really. However complex your language | is, it can be parsed at 100 GB/s on a Raspberry Pi. | | Actually, you're quite wrong. RapidJSON only handles a very | basic language, and doesn't need to handle any weird errors | to output meaningful errors, and it boasts at most 1GB/s. | | And this is a parser optimized for speed. | | Perhaps if more people had a clue about compilers, this | sort of errors and misconceptions wouldn't be so common. | estebank wrote: | > while you cannot get a compiler out of the door withot | a parser, you can indeed leave all the fancy optimization | features out, don't you? And if you want to get a MVP out | of the door, do you focus your effort in developing | optional features or the basic, critical part that | ensures that your language exists? | | > And then there's developer experience. Syntax/grammar | errors? You need the parser for that, don't you agree? | | These two statements are at odds with each other as | having great developer experience is indeed a "fancy | optimization" that you shouldn't spend too much time on | when starting a project (and I say that as someone | spending their time every day thinking about making | compiler errors better), and part of the reason the | literature around parsers is problematic is because | graceful error recovery isn't explored in detail. | | > And this is a parser optimized for speed. | | Parse time is never the bottleneck. It is a waste of time | to focus on optimizing parse speed at the beginning. It | is worth it to eventually measure and tune it, and to | think about the implications when designing your grammar, | but beyond that the slowness of compilers is never driven | by their parsers. | | > you cannot get a compiler out of the door withot a | parser | | Technically, you can: you make your compiler something | that consumes an already existing bytecode (like the | JVM's or .Net's CLI) or only implement an AST parser that | isn't intended for humans to actually write while you | nail down the specifics. | | > Syntax/grammar errors? You need the parser for that, | don't you agree? | | Error recovery in a parser is a tricky thing: it is | intimately tied to the choices you've made in your | grammar. The more redundancy and uniqueness between | scopes you can have, the easier it can be for your parser | to identify where things have gone badly. For example, in | python if you write if foo = bar: | print(foo) | | you have little information but can still infer that you | likely wanted to compare equality instead using ==. But | because Python has a very flexible grammar and semantics, | the following is valid but likely not what was intended: | class Foo: def bar(self): | print("hi!") foo = Foo() foo.bar = | "hi!" | | If you make different more restrictive decisions on both | grammar and semantics you can still provide tools to | allow people to do this, by being more verbose, but | people who do this by accident can be given excellent | diagnostics. | dmitriid wrote: | > I mean, think about it for a second: while you cannot | get a compiler out of the door withot a parser, you can | indeed leave all the fancy optimization features out, | don't you? And if you want to get a MVP out of the door, | do you focus your effort in developing optional features | or the basic, critical part that ensures that your | language exists? | | Exactly: _an MVP_. For an _MVP_ the parser may indeed be | "the single most important feature of building a | programming language.". But even then it's almost | definitely isn't. | | The moment you "leave out fancy optimisation fetures" | your language becomes at most mediocre. | | Because between parsing (which is trivial) and "fancy | optimisation features" (which come very late in a | compiler life) there are a million things that are | significantly more important than parsing: | | - how do you do typing? | | - how do you code generation (do you output machine code | yourself or use a backend)? | | - do you use an intermediate representation? | | - what are the actual optimisations on performance (none | of which have anything to do with parsing)? As an | example, C++'s horrible header system has nothing to do | with parsing, but slows compiling by hundreds of orders | of magnitude. | | - how do you do graceful error recovery and reporting? | | - how are lambdas handled? how are scopes handled? memory | layouts? GC/ARC/manual memory management? standard lib? | | - a million other things | | > Actually, you're quite wrong. | | It was a hyperbole. Do you even understand how fast 1GB/s | is? It roughly 10 to 20 million lines of code [1] Windows | 10 is roughly 50 million lines [2]. So, at 1GB/s you can | parse the _entirety of Windows 10 codebase_ in under 5 | seconds. | | Even if your parsing speed is 1000 times slower (3 orders | of magnitude slower), you can still parse ~10k lines a | second. And I honsetly can't even begin to think how you | can parse something this slowly. ANd if you do, it's | still not a concern, because the rest comes from outside | parsing, for example: | | - can yo run parsers in parallel? | | - do you need to parse the entirety of your project to | begin compilation, or your files/modules are isolated and | compiler can work on any of those? | | Parsing is _trivial_. | | [1] For example, this SO answer for an unrelated question | creates a file with 150 million lines, and the result is | 6GB. 150/6 = 25, I went for even lower numbers. | | [2] https://answers.microsoft.com/en- | us/windows/forum/all/window... | tom_mellior wrote: | > I'm not an expert by far, but I feel like many people who | write compilers or develop languages no longer recommend The | Dragon Book as it's very outdated. | | Yes. There are newer editions that I can't judge, but the 1986 | edition linked from the page should nowadays only be read by | historians. If you want to implement a compiler, look | elsewhere, for something newer. | | This isn't to say that this was a bad book 35 years ago. It | wasn't. But it is simply outdated. It does not contain things | that are important today. It has very little treatment of type | systems and type inference, for example. Its treatment of data | flow analysis is based on intervals, which have limitations | (though also some interesting theoretical advantages) and are | used by zero compilers written in this millennium (and not even | described very well in the Dragon book IIRC). And then of | course there is the parsing problem, see my rant elsethread. | | As for the article, I think A Catalogue of Optimizing | Transformations (https://www.clear.rice.edu/comp512/Lectures/Pa | pers/1971-alle...) would make an interesting addition. | | Edit: Forgot to add that the old old Dragon book doesn't cover | two topics that the article itself considers very relevant: SSA | form (which I believe was discovered just around that time) and | linear scan register allocation (which is much newer). | aarchi wrote: | What would you recommend that covers data flow analysis | better than the dragon book? I'm currently starting data flow | analysis for my own compiler using intervals and/or modular | congruences (operations can be reduced to congruences, then | combined with the Chinese Remainder Theorem). What are the | theoretical advantages of intervals and what do you recommend | over that? | tom_mellior wrote: | Ah, sorry, I should probably clarify that there are two | completely distinct uses of the term "interval" in data | flow analysis. | | One (introduced in https://amturing.acm.org/p137-allen.pdf | and discussed in the old Dragon Book) groups _program | points_ into nested "intervals", i.e., intervals are | pieces of code. Analysis information is then propagated | among these intervals. This is what I was referring to | above. As far as I know this approach is completely | obsolete by now, with everyone using methods based more or | less indirectly on Kildall's worklist iteration approach (h | ttp://static.aminer.org/pdf/PDF/000/546/451/a_unified_appro | ...). The theoretical advantage I was referring to above | was that the interval approach is AFAIR sometimes more | efficient because it can take program structure (loop | nesting) into account more directly. | | The _other_ meaning of "intervals" is used for an | _analysis domain_ , i.e., the kind of data you want your | analysis to compute. For example, you might determine that | the possible values of some variable x are in the interval | [10, 20], the values of y are in the interval [5, 15], and | then you can compute that the values of z = x + y must be | in [15, 35]. There is nothing wrong with this approach, and | any optimizing compiler is pretty much forced to use | something like this. Based on your reference to modular | congruences I think you probably want to do analysis of | numerical values and meant this sense, which is anyway the | only one in common use nowadays. Go ahead and do this. | Extensions are with congruences, like "x is in [10, 20] and | also congruent to 1 modulo 4" and/or "known bits", like "x | is in [10, 20], and I know that (counting from the LSB) its | bit 2 is 0 and bit 3 is 1". | | Personally I learned about data flow analysis in a | university course based on Principles of Program Analysis: | https://www.springer.com/gp/book/9783540654100. It's a book | that goes into a lot of mathematical depth but not much | breadth of analyses, and it doesn't handle data flow | analysis on SSA form. Then I learned additional details | from various well-known papers like the ones linked above, | and the classical early papers around analysis on SSA form, | like https://www.cse.wustl.edu/~cytron/531Pages/f11/Resourc | es/Pap.... I'm sure there must be a more straightforward | text to get you there, but I don't think I can give a | definite recommendation. I have vague memories of the | coverage in Muchnick's "Advanced Compiler Design and | Implementation" and/or Cooper and Torczon's "Engineering a | Compiler" being reasonable, but I can't find my copies | right now. | Rochus wrote: | > no longer recommend The Dragon Book as it's very outdated | | Only if they didn't realize there is a second edition from | 2006. | testrvb wrote: | test ___________________________________________________________________ (page generated 2021-04-24 23:00 UTC)