[HN Gopher] Writing a C compiler in 500 lines of Python ___________________________________________________________________ Writing a C compiler in 500 lines of Python Author : vgel Score : 170 points Date : 2023-09-04 19:17 UTC (3 hours ago) (HTM) web link (vgel.me) (TXT) w3m dump (vgel.me) | brundolf wrote: | > Instead, we'll be single-pass: code generation happens during | parsing | | IIRC, C was specifically designed to allow single-pass | compilation, right? I.e. in many languages you don't know what | needs to be output without parsing the full AST, but in C, syntax | directly implies semantics. I think I remember hearing this was | because early computers couldn't necessarily fit the AST for an | entire code file in memory at once | WalterBright wrote: | You're exactly right. This makes for a small, memory-efficient | compiler. But this entails a lot of compromises that we're not | willing to put up with anymore, because there's no longer a | reason to. | frutiger wrote: | I once read that this is why the MSVC compiler didn't support | two-pass template instantiation until very recently: the | original compiler implemented templates almost like a macro | that re-emitted a stream of tokens with the template parameters | replaced. | speps wrote: | Linked from another thread: http://cm.bell- | labs.co/who/dmr/chist.html | | It explains the memory limits and what happened :) | | > After the TMG version of B was working, Thompson rewrote B in | itself (a bootstrapping step). During development, he | continually struggled against memory limitations: each language | addition inflated the compiler so it could barely fit, but each | rewrite taking advantage of the feature reduced its size. For | example, B introduced generalized assignment operators, using | x=+y to add y to x. The notation came from Algol 68 | [Wijngaarden 75] via McIlroy, who had incorporated it into his | version of TMG. (In B and early C, the operator was spelled =+ | instead of += ; this mistake, repaired in 1976, was induced by | a seductively easy way of handling the first form in B's | lexical analyzer.) | bee_rider wrote: | I wonder why =+ is so obviously a mistake. It does look | vaguely wrong for some reason, but I'm prejudiced by current | languages. | vgel wrote: | I think because it's ambiguous with unary plus (a = +b), | since C isn't supposed to have significant whitespace in | most circumstances. | tom_ wrote: | You also run into problems with a=*p and a=-b, which are | perhaps more likely. | zeusk wrote: | But they could've fixed that by going a=(*p) and a=(-b); | | Kind of how we use (*p)->next instead of *p->next where p | is node_t** | vgel wrote: | I'm not sure, haven't looked at the codebases of old compilers | in a long time. Definitely a lot of the language is pretty | amenable to it, especially if you have unstructured jumps for | e.g. the for advancement statement. I had a distinct feeling | while writing the compiler every time I added a new feature | that "wow, the semantics work exactly how I'd like them to for | ease of implementation." | | Compare that to, say, Rust, which would be pretty painful to | single-pass compile with all the non-local behavior around | traits. | [deleted] | zabzonk wrote: | using recursive descent, you don't need to build an ast | jjtheblunt wrote: | the call stack during recursive descent is an ephemeral ast, | for the recursive descent parsers I've written. | WalterBright wrote: | This looks a lot like the Tiny Pascal compiler that BYTE | published a listing of back in 1978. | | http://www.trs-80.org/tiny-pascal/ | | I figured out the basics of how a compiler works by going through | it line by line. | vgel wrote: | Oh, that's neat (funny that they skipped out on similar things | to me, like GOTO and structs :-) | | I didn't see a link to the source in the article, but this | seems to be it: https://sourceforge.net/p/tiny- | pascal/code/HEAD/tree/NorthSt... | dugmartin wrote: | I think Borland's Turbo Pascal was also a single pass compiler | that emitted machine code as COM files. | kwhitefoot wrote: | Surely it is a feature of all Pascal compilers that they are | single pass. I thought that it was part of the specification | of the language that it be possible to compile in a single | pass. | [deleted] | marcodiego wrote: | It is interesting to think that 500 lines of code is something | one can write in one or two days. But, writing a C compiler in | 500 of comprehensible code (even in python) is challenge in | itself that may take months after a few years of solid learning. | | I wonder if is this a good path to becoming an extremely | productive developer. If some one spends time developing projects | like this, but for different areas... A kernel, a compressor, | renderer, multimedia/network stack, IA/ML... Will that turn a | good dev into a 0.1 Bellard? | rewmie wrote: | > But, writing a C compiler in 500 of comprehensible code (even | in python) is challenge in itself that may take months after a | few years of solid learning. | | The people behind this project avoided that caveat by simply | not implementing C. Apparently they kept a bit of the syntax | but then proceeded to cherry-pick features that suited them and | not make.an effort to even try to comply with any version of | the standard. | fanf2 wrote: | It's a little bit bigger than Small C | https://en.m.wikipedia.org/wiki/Small-C | pitherpather wrote: | > 0.1 Bellard | | Off topic, but a log scale might be useful: 0.1 Bellard --> -10 | deciBellards. That allows for: 0.001 Bellard --> -30 | deciBellards. | | Problem: Programmers with negative productivity cannot be | represented on the same log scale. | analog31 wrote: | >>> Problem: Programmers with negative productivity cannot be | represented on the same log scale. | | This is similar to the problem of price-to-earnings ratio. | The ratio goes asymptotic as earnings goes through zero. It | would be better to quote earnings-to-price ratio. Another | screwy reciprocal unit is miles per gallon for cars. | hedora wrote: | Now that I have an EV with 130 miles of range, I'm finding | miles per kwh to be much more useful than efficiency. | | I'd bet range anxiety was also a thing for early gasoline | powered cars, so the early adopters of those probably | preferred mpg over gpm. | araes wrote: | Sure they can. Abs(). Or if you prefer to not have quite so | much micro-nano, you could also use Cumulative Distribution | Functions [1] which are basically just sums of Probability | Density [2]. | | Are they a 4s programmer, 1s programmer, -0.5s programmer, | -2s programmer? | | Plus, most people are "average" not negative productivity, | and CDFs let you use really fun stuff like Beta Distributions | (variable, shapable distributions) and Gamma Distributions | (exponential distributions). They're super sweet as far as | probability statistics. | | [1] https://en.wikipedia.org/wiki/Cumulative_distribution_fun | cti... | | [2] | https://en.wikipedia.org/wiki/Probability_density_function | | [3] https://en.wikipedia.org/wiki/Beta_distribution | | [4] https://en.wikipedia.org/wiki/Gamma_distribution | hedora wrote: | The number of people that are average is infinitesimal, | even if you define average as "mode" instead of median or | mean. | | About half are better than median though. | [deleted] | sciolist wrote: | It does remind me of a project [1] Andrej Karpathy did, writing | a neural network and training code in ~600 lines (although | networks have easier logic to code than a compiler). | | [1] https://github.com/karpathy/nanoGPT | pama wrote: | This is an implementation of GPT using the pytorch library. | It is not meant to be the shortest implementation of a | trainable GPT, however it is very clean code. Pytorch does a | lot of the heavy lifting, especially when it comes to | training on multiple GPU. This implementation only works with | data distributed parallel training, so one could not train | models of the size of GPT-4 with it out of the box. | tavianator wrote: | Perhaps they were thinking of | https://github.com/karpathy/micrograd | Barrin92 wrote: | at the very least it'll remove a lot of 'magic' from | programming. Today a lot of people seem to be not so fond of | university education but I'm personally very glad it made me go | through implementing a shell, a compiler, a little toy kernel | and so on. | | The feeling that you write code somewhere in the skies and have | no idea how something works underneath has always really bugged | me when I've used something. | chaxor wrote: | You don't need a university education to do those things, | just some curiosity. | | The function of the university in the near future will | probably just be to have like-minded curious people to | discuss ideas with, and to get a better grasp of what | problems need to be solved (specifically scientific ideas, | rather than just applying engineering). | | The prestige element (specifically of certain universities | over others, perhaps not university over high school) is | dwindling, and hopefully will be abolished with this new | generation. | jabits wrote: | A university degree is much more than this, and I think | most people who view its value as "dwindling" have not had | the experience... | solumunus wrote: | It massively depends on what degree and where I suppose. | Most people I know with degrees view it as a complete | waste of money. | convolvatron wrote: | I'm largely taught outside the academic world. so I | sympathize with your position. | | however, the engineering culture which took the time to | tell me about all these cool things and let me grow into | being an expert in them seems to be largely gone. | mati365 wrote: | I made similar project in TypeScript[1]. Basically multipass | compiler that generates x86 assembly, compiles it to binary and | runs it. The worst thing were register allocator, designing IR | code and assembler. | | [1] https://github.com/Mati365/ts-c-compiler | amedvednikov wrote: | Nice project! | vgel wrote: | Ooh, this is cool! Using WASM let me avoid writing a register | allocator (though I probably would have just used the stack if | I had targeted x86/ARM since I wasn't going for speed). | kaycebasques wrote: | Is there a C compiler written in Python that aims for maximum | readability rather than trying to get as much done under X lines | of code? | muth02446 wrote: | Not quite a C compiler but arguably better: | | http://cwerg.org | vgel wrote: | I think the code is fairly readable! It's formatted with Black | (and therefore limited to reasonable line lengths) and well- | commented. | | IMO, being under X lines of code is _part of_ the readability-- | 10,000 lines of code is hard to approach no matter how readable | it otherwise is. | rhabarba wrote: | Finally, one can have inefficient C. | wiseowise wrote: | Why would language choice of compiler make any difference for | efficiency of final output? | vgel wrote: | Maybe not the language choice, but the codegen of this | compiler is _terrible_ because of the single-pass shortcuts | (for example, it unconditionally loads the result of all | assignment operations back to the stack _just in case_ you | want to write `a = b = 1`, even though 99% of the time that | load is immediately thrown away.) | [deleted] | NeuroCoder wrote: | They didn't say the language was the issue. It doesn't | support the full C spec. But if you want a reason why | language might be an issue for a compiler, it could make | compilation time slower. But I think the point of this | project is not real world use but fun demonstration of skill | folmar wrote: | Always remember _bashcc_. | [deleted] | MaxBarraclough wrote: | There's always the _CINT_ interpreter for C and C++. | | https://root.cern.ch/root/html534/guides/users-guide/CINT.ht... | brnt wrote: | A PTSD trigger for me. Only half joking. Funny thing is, I | never checked out Cling to see if it was at long last the | real deal. | nn3 wrote: | Just for comparison the LOCs for some other small C or C like | compilers. It's not that far away from Ritchie's | | C4x86 | 0.6K (very close) | | small C (x86) | 3.1K | | Ritchie's earliest struct compiler | 2.3K | | v7 Unix C compiler | 10.2K | | chibicc | 8.4K | | Biederman's romcc | 25.0K | userbinator wrote: | This one is certainly stretching the definition of "C like", | but it's just under 512 _bytes_ : | https://news.ycombinator.com/item?id=36064971 | vgel wrote: | Oh, C4 is neat--technically it has me beat since it also | implements the VM to _run_ the code--though their formatting | definitely takes advantage of long lines :-) | [deleted] | tptacek wrote: | A time-honored approach! | | https://www.blackhat.com/presentations/win-usa-04/bh-win-04-... | | (minus directly emitting opcodes, and fitting into 500 lines, of | course.) | fan_of_yoinked wrote: | I love the graphic - would go see the worlds largest chomsky | rcarmo wrote: | I have to wonder if there's a Scheme to WASM compiler out there | someplace right now I haven't found yet. | vgel wrote: | Looks like Schism (https://github.com/schism-lang/schism) got | part of the way there, but it unfortunately seems to be dead. | teddyh wrote: | For some value of "C": | | > _Notably, it doesn 't support:_ | | > _structs :-( would be possible with more code, the fundamentals | were there, I just couldn 't squeeze it in_ | | > _enums / unions_ | | > _preprocessor directives (this would probably be 500 lines by | itself...)_ | | > _floating point. would also be possible, the wasm_type stuff is | in, again just couldn 't squeeze it in_ | | > _8 byte types (long /long long or double)_ | | > _some other small things like pre /post cremements, in-place | initialization, etc., which just didn't quite fit any sort of | standard library or i/o that isn't returning an integer from | main()_ | | > _casting expressions_ | [deleted] | vgel wrote: | Well, I set the 500 line budget up front, and that was really | as much as I could fit with reasonable formatting. I'll be | excited to see your 500 line C compiler supporting all those | features once it's done ;-) | spease wrote: | C--23 | | (Respect to the author for doing this, I just couldn't resist | the obvious joke) | vgel wrote: | I actually almost made it a C-- | (https://www.cs.tufts.edu/~nr/c--/download/ppdp.pdf) | compiler, but IIRC the `goto`s made me go with the regular C | subset instead. ___________________________________________________________________ (page generated 2023-09-04 23:00 UTC)