[HN Gopher] Writing a C compiler in 500 lines of Python
       ___________________________________________________________________
        
       Writing a C compiler in 500 lines of Python
        
       Author : vgel
       Score  : 170 points
       Date   : 2023-09-04 19:17 UTC (3 hours ago)
        
 (HTM) web link (vgel.me)
 (TXT) w3m dump (vgel.me)
        
       | brundolf wrote:
       | > Instead, we'll be single-pass: code generation happens during
       | parsing
       | 
       | IIRC, C was specifically designed to allow single-pass
       | compilation, right? I.e. in many languages you don't know what
       | needs to be output without parsing the full AST, but in C, syntax
       | directly implies semantics. I think I remember hearing this was
       | because early computers couldn't necessarily fit the AST for an
       | entire code file in memory at once
        
         | WalterBright wrote:
         | You're exactly right. This makes for a small, memory-efficient
         | compiler. But this entails a lot of compromises that we're not
         | willing to put up with anymore, because there's no longer a
         | reason to.
        
         | frutiger wrote:
         | I once read that this is why the MSVC compiler didn't support
         | two-pass template instantiation until very recently: the
         | original compiler implemented templates almost like a macro
         | that re-emitted a stream of tokens with the template parameters
         | replaced.
        
         | speps wrote:
         | Linked from another thread: http://cm.bell-
         | labs.co/who/dmr/chist.html
         | 
         | It explains the memory limits and what happened :)
         | 
         | > After the TMG version of B was working, Thompson rewrote B in
         | itself (a bootstrapping step). During development, he
         | continually struggled against memory limitations: each language
         | addition inflated the compiler so it could barely fit, but each
         | rewrite taking advantage of the feature reduced its size. For
         | example, B introduced generalized assignment operators, using
         | x=+y to add y to x. The notation came from Algol 68
         | [Wijngaarden 75] via McIlroy, who had incorporated it into his
         | version of TMG. (In B and early C, the operator was spelled =+
         | instead of += ; this mistake, repaired in 1976, was induced by
         | a seductively easy way of handling the first form in B's
         | lexical analyzer.)
        
           | bee_rider wrote:
           | I wonder why =+ is so obviously a mistake. It does look
           | vaguely wrong for some reason, but I'm prejudiced by current
           | languages.
        
             | vgel wrote:
             | I think because it's ambiguous with unary plus (a = +b),
             | since C isn't supposed to have significant whitespace in
             | most circumstances.
        
               | tom_ wrote:
               | You also run into problems with a=*p and a=-b, which are
               | perhaps more likely.
        
               | zeusk wrote:
               | But they could've fixed that by going a=(*p) and a=(-b);
               | 
               | Kind of how we use (*p)->next instead of *p->next where p
               | is node_t**
        
         | vgel wrote:
         | I'm not sure, haven't looked at the codebases of old compilers
         | in a long time. Definitely a lot of the language is pretty
         | amenable to it, especially if you have unstructured jumps for
         | e.g. the for advancement statement. I had a distinct feeling
         | while writing the compiler every time I added a new feature
         | that "wow, the semantics work exactly how I'd like them to for
         | ease of implementation."
         | 
         | Compare that to, say, Rust, which would be pretty painful to
         | single-pass compile with all the non-local behavior around
         | traits.
        
           | [deleted]
        
         | zabzonk wrote:
         | using recursive descent, you don't need to build an ast
        
           | jjtheblunt wrote:
           | the call stack during recursive descent is an ephemeral ast,
           | for the recursive descent parsers I've written.
        
       | WalterBright wrote:
       | This looks a lot like the Tiny Pascal compiler that BYTE
       | published a listing of back in 1978.
       | 
       | http://www.trs-80.org/tiny-pascal/
       | 
       | I figured out the basics of how a compiler works by going through
       | it line by line.
        
         | vgel wrote:
         | Oh, that's neat (funny that they skipped out on similar things
         | to me, like GOTO and structs :-)
         | 
         | I didn't see a link to the source in the article, but this
         | seems to be it: https://sourceforge.net/p/tiny-
         | pascal/code/HEAD/tree/NorthSt...
        
         | dugmartin wrote:
         | I think Borland's Turbo Pascal was also a single pass compiler
         | that emitted machine code as COM files.
        
           | kwhitefoot wrote:
           | Surely it is a feature of all Pascal compilers that they are
           | single pass. I thought that it was part of the specification
           | of the language that it be possible to compile in a single
           | pass.
        
             | [deleted]
        
       | marcodiego wrote:
       | It is interesting to think that 500 lines of code is something
       | one can write in one or two days. But, writing a C compiler in
       | 500 of comprehensible code (even in python) is challenge in
       | itself that may take months after a few years of solid learning.
       | 
       | I wonder if is this a good path to becoming an extremely
       | productive developer. If some one spends time developing projects
       | like this, but for different areas... A kernel, a compressor,
       | renderer, multimedia/network stack, IA/ML... Will that turn a
       | good dev into a 0.1 Bellard?
        
         | rewmie wrote:
         | > But, writing a C compiler in 500 of comprehensible code (even
         | in python) is challenge in itself that may take months after a
         | few years of solid learning.
         | 
         | The people behind this project avoided that caveat by simply
         | not implementing C. Apparently they kept a bit of the syntax
         | but then proceeded to cherry-pick features that suited them and
         | not make.an effort to even try to comply with any version of
         | the standard.
        
           | fanf2 wrote:
           | It's a little bit bigger than Small C
           | https://en.m.wikipedia.org/wiki/Small-C
        
         | pitherpather wrote:
         | > 0.1 Bellard
         | 
         | Off topic, but a log scale might be useful: 0.1 Bellard --> -10
         | deciBellards. That allows for: 0.001 Bellard --> -30
         | deciBellards.
         | 
         | Problem: Programmers with negative productivity cannot be
         | represented on the same log scale.
        
           | analog31 wrote:
           | >>> Problem: Programmers with negative productivity cannot be
           | represented on the same log scale.
           | 
           | This is similar to the problem of price-to-earnings ratio.
           | The ratio goes asymptotic as earnings goes through zero. It
           | would be better to quote earnings-to-price ratio. Another
           | screwy reciprocal unit is miles per gallon for cars.
        
             | hedora wrote:
             | Now that I have an EV with 130 miles of range, I'm finding
             | miles per kwh to be much more useful than efficiency.
             | 
             | I'd bet range anxiety was also a thing for early gasoline
             | powered cars, so the early adopters of those probably
             | preferred mpg over gpm.
        
           | araes wrote:
           | Sure they can. Abs(). Or if you prefer to not have quite so
           | much micro-nano, you could also use Cumulative Distribution
           | Functions [1] which are basically just sums of Probability
           | Density [2].
           | 
           | Are they a 4s programmer, 1s programmer, -0.5s programmer,
           | -2s programmer?
           | 
           | Plus, most people are "average" not negative productivity,
           | and CDFs let you use really fun stuff like Beta Distributions
           | (variable, shapable distributions) and Gamma Distributions
           | (exponential distributions). They're super sweet as far as
           | probability statistics.
           | 
           | [1] https://en.wikipedia.org/wiki/Cumulative_distribution_fun
           | cti...
           | 
           | [2]
           | https://en.wikipedia.org/wiki/Probability_density_function
           | 
           | [3] https://en.wikipedia.org/wiki/Beta_distribution
           | 
           | [4] https://en.wikipedia.org/wiki/Gamma_distribution
        
             | hedora wrote:
             | The number of people that are average is infinitesimal,
             | even if you define average as "mode" instead of median or
             | mean.
             | 
             | About half are better than median though.
        
         | [deleted]
        
         | sciolist wrote:
         | It does remind me of a project [1] Andrej Karpathy did, writing
         | a neural network and training code in ~600 lines (although
         | networks have easier logic to code than a compiler).
         | 
         | [1] https://github.com/karpathy/nanoGPT
        
           | pama wrote:
           | This is an implementation of GPT using the pytorch library.
           | It is not meant to be the shortest implementation of a
           | trainable GPT, however it is very clean code. Pytorch does a
           | lot of the heavy lifting, especially when it comes to
           | training on multiple GPU. This implementation only works with
           | data distributed parallel training, so one could not train
           | models of the size of GPT-4 with it out of the box.
        
             | tavianator wrote:
             | Perhaps they were thinking of
             | https://github.com/karpathy/micrograd
        
         | Barrin92 wrote:
         | at the very least it'll remove a lot of 'magic' from
         | programming. Today a lot of people seem to be not so fond of
         | university education but I'm personally very glad it made me go
         | through implementing a shell, a compiler, a little toy kernel
         | and so on.
         | 
         | The feeling that you write code somewhere in the skies and have
         | no idea how something works underneath has always really bugged
         | me when I've used something.
        
           | chaxor wrote:
           | You don't need a university education to do those things,
           | just some curiosity.
           | 
           | The function of the university in the near future will
           | probably just be to have like-minded curious people to
           | discuss ideas with, and to get a better grasp of what
           | problems need to be solved (specifically scientific ideas,
           | rather than just applying engineering).
           | 
           | The prestige element (specifically of certain universities
           | over others, perhaps not university over high school) is
           | dwindling, and hopefully will be abolished with this new
           | generation.
        
             | jabits wrote:
             | A university degree is much more than this, and I think
             | most people who view its value as "dwindling" have not had
             | the experience...
        
               | solumunus wrote:
               | It massively depends on what degree and where I suppose.
               | Most people I know with degrees view it as a complete
               | waste of money.
        
             | convolvatron wrote:
             | I'm largely taught outside the academic world. so I
             | sympathize with your position.
             | 
             | however, the engineering culture which took the time to
             | tell me about all these cool things and let me grow into
             | being an expert in them seems to be largely gone.
        
       | mati365 wrote:
       | I made similar project in TypeScript[1]. Basically multipass
       | compiler that generates x86 assembly, compiles it to binary and
       | runs it. The worst thing were register allocator, designing IR
       | code and assembler.
       | 
       | [1] https://github.com/Mati365/ts-c-compiler
        
         | amedvednikov wrote:
         | Nice project!
        
         | vgel wrote:
         | Ooh, this is cool! Using WASM let me avoid writing a register
         | allocator (though I probably would have just used the stack if
         | I had targeted x86/ARM since I wasn't going for speed).
        
       | kaycebasques wrote:
       | Is there a C compiler written in Python that aims for maximum
       | readability rather than trying to get as much done under X lines
       | of code?
        
         | muth02446 wrote:
         | Not quite a C compiler but arguably better:
         | 
         | http://cwerg.org
        
         | vgel wrote:
         | I think the code is fairly readable! It's formatted with Black
         | (and therefore limited to reasonable line lengths) and well-
         | commented.
         | 
         | IMO, being under X lines of code is _part of_ the readability--
         | 10,000 lines of code is hard to approach no matter how readable
         | it otherwise is.
        
       | rhabarba wrote:
       | Finally, one can have inefficient C.
        
         | wiseowise wrote:
         | Why would language choice of compiler make any difference for
         | efficiency of final output?
        
           | vgel wrote:
           | Maybe not the language choice, but the codegen of this
           | compiler is _terrible_ because of the single-pass shortcuts
           | (for example, it unconditionally loads the result of all
           | assignment operations back to the stack _just in case_ you
           | want to write `a = b = 1`, even though 99% of the time that
           | load is immediately thrown away.)
        
           | [deleted]
        
           | NeuroCoder wrote:
           | They didn't say the language was the issue. It doesn't
           | support the full C spec. But if you want a reason why
           | language might be an issue for a compiler, it could make
           | compilation time slower. But I think the point of this
           | project is not real world use but fun demonstration of skill
        
         | folmar wrote:
         | Always remember _bashcc_.
        
         | [deleted]
        
         | MaxBarraclough wrote:
         | There's always the _CINT_ interpreter for C and C++.
         | 
         | https://root.cern.ch/root/html534/guides/users-guide/CINT.ht...
        
           | brnt wrote:
           | A PTSD trigger for me. Only half joking. Funny thing is, I
           | never checked out Cling to see if it was at long last the
           | real deal.
        
       | nn3 wrote:
       | Just for comparison the LOCs for some other small C or C like
       | compilers. It's not that far away from Ritchie's
       | 
       | C4x86 | 0.6K (very close)
       | 
       | small C (x86) | 3.1K
       | 
       | Ritchie's earliest struct compiler | 2.3K
       | 
       | v7 Unix C compiler | 10.2K
       | 
       | chibicc | 8.4K
       | 
       | Biederman's romcc | 25.0K
        
         | userbinator wrote:
         | This one is certainly stretching the definition of "C like",
         | but it's just under 512 _bytes_ :
         | https://news.ycombinator.com/item?id=36064971
        
         | vgel wrote:
         | Oh, C4 is neat--technically it has me beat since it also
         | implements the VM to _run_ the code--though their formatting
         | definitely takes advantage of long lines :-)
        
         | [deleted]
        
       | tptacek wrote:
       | A time-honored approach!
       | 
       | https://www.blackhat.com/presentations/win-usa-04/bh-win-04-...
       | 
       | (minus directly emitting opcodes, and fitting into 500 lines, of
       | course.)
        
       | fan_of_yoinked wrote:
       | I love the graphic - would go see the worlds largest chomsky
        
       | rcarmo wrote:
       | I have to wonder if there's a Scheme to WASM compiler out there
       | someplace right now I haven't found yet.
        
         | vgel wrote:
         | Looks like Schism (https://github.com/schism-lang/schism) got
         | part of the way there, but it unfortunately seems to be dead.
        
       | teddyh wrote:
       | For some value of "C":
       | 
       | > _Notably, it doesn 't support:_
       | 
       | > _structs :-( would be possible with more code, the fundamentals
       | were there, I just couldn 't squeeze it in_
       | 
       | > _enums / unions_
       | 
       | > _preprocessor directives (this would probably be 500 lines by
       | itself...)_
       | 
       | > _floating point. would also be possible, the wasm_type stuff is
       | in, again just couldn 't squeeze it in_
       | 
       | > _8 byte types (long /long long or double)_
       | 
       | > _some other small things like pre /post cremements, in-place
       | initialization, etc., which just didn't quite fit any sort of
       | standard library or i/o that isn't returning an integer from
       | main()_
       | 
       | > _casting expressions_
        
         | [deleted]
        
         | vgel wrote:
         | Well, I set the 500 line budget up front, and that was really
         | as much as I could fit with reasonable formatting. I'll be
         | excited to see your 500 line C compiler supporting all those
         | features once it's done ;-)
        
         | spease wrote:
         | C--23
         | 
         | (Respect to the author for doing this, I just couldn't resist
         | the obvious joke)
        
           | vgel wrote:
           | I actually almost made it a C--
           | (https://www.cs.tufts.edu/~nr/c--/download/ppdp.pdf)
           | compiler, but IIRC the `goto`s made me go with the regular C
           | subset instead.
        
       ___________________________________________________________________
       (page generated 2023-09-04 23:00 UTC)