[HN Gopher] Show HN: A work-in-progress C compiler from scratch
       ___________________________________________________________________
        
       Show HN: A work-in-progress C compiler from scratch
        
       Author : r1chardnl
       Score  : 131 points
       Date   : 2021-08-03 12:50 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jjice wrote:
       | This is just an anecdote, but I had to write a SQL DDL and DML
       | parser during my last semester. I wrote the DDL parser by hand,
       | and it wasn't as bad as I expected, but it was time consuming. I
       | managed to convince the professor to give us the option of using
       | a parser generator for the next phase (DML) since the point of
       | the class wasn't parsing context free grammars and more focused
       | on executing the SQL.
       | 
       | I used Flex and Bison since the project was in C. Getting up and
       | running and understanding how the tools have to be set up with
       | different options was a bit tricky, but after that, my parser was
       | up and running in about two hours, compared to probably four
       | times that for the hand written DDL. Our DML subset was also much
       | larger and more complex than our DDL, so I was very happy with
       | the development speed increase.
       | 
       | I had this idea that using a parser generator was slow and
       | wasteful since many modern tutorials online write them by hand
       | and speak against parser generators (possibly because there isn't
       | a catch all for all languages). Turns out dev speed is way more
       | important to me up front, because in the case that I notice
       | parsing speed actually being an issue I should be happy that my
       | MVP has gotten enough use.
       | 
       | It's also nice because a lexer and parser can be pretty easily
       | black-boxed and swapped out for your hand written, just keep the
       | AST and API the same and you should be good.
       | 
       | All that said, that's personal preferences and writing the parser
       | by hand is definitely good experience and more extensible,
       | especially for error handling. Nice work!
        
         | brundolf wrote:
         | To add an anecdote, I've gotten into a hobby of writing parsers
         | and parser-related things, and I've developed a pattern and a
         | handful of utility functions (which are small enough to be re-
         | written from memory) that together make hand-writing a pretty
         | breezy task. Just the other week at work I hand-wrote a GraphQL
         | schema parser from scratch in about 2 hours using this
         | technique (GraphQL's schemas have a relatively simple syntax,
         | but I still think it's a decent data point)
        
           | qorrect wrote:
           | Is it hosted somewhere we can see it?
        
             | brundolf wrote:
             | It's not, though I've thought about open-sourcing or doing
             | a write-up on my pattern, but I've just assumed there are
             | tons of libraries out there for this that are probably
             | better than my homegrown version. Could still make an
             | interesting blog post though, I suppose
        
         | WalterBright wrote:
         | > dev speed is way more important to me up front
         | 
         | Having written many parsers, I can attest that the time spent
         | writing and debugging the parser is about 0.0000001% of the
         | time you'll invest in the compiler.
         | 
         | Heck, I spent more time trying to figure out how to integrate
         | the tag name space from ImportC into the host D compiler's
         | symbol table mechanism, than I did writing the C parser.
        
         | dboreham wrote:
         | Your post came through a wormhole from 1979.
        
       | SavantIdiot wrote:
       | Every time I see a compiler, the first thing a zero in on is the
       | parser.
       | 
       | I don't mean this in a bad way, but that tokenizer looks a bit
       | ... simple. ;-)
       | 
       | I say that because ~26 years ago I had a go at writing a C
       | preprocessor with the goal of creating a ToC and Index for my C
       | projects. I sat down with a BNF of C and lex/yacc and got ground
       | up in the details, then found some software that did exactly what
       | I wanted and moved on. I just remember a few parts of the C
       | syntax that don't fit nicely into BNF and hadn't studied parsers
       | enough to dig myself out of the hole.
       | 
       | Good luck doing it manually, that's gonna be tricky as heck.
        
         | fao_ wrote:
         | Parsing it manually is actually the preferred choice for most
         | compilers, both GCC and Clang do it manually, for example.
        
           | SavantIdiot wrote:
           | Really? That's probably why I failed! Ironically (given my
           | post) I've never looked at Clang's parser. I have looked at
           | GCC's and, well, sheepishly: I don't get it, it is a beast of
           | a project and I wasn't able to crack the architecture. I
           | should try again tho.
        
             | rot13xor wrote:
             | C is almost context-free with the exception of typedef. A
             | custom parser helps with error diagnostics, e.g. missed
             | semicolon or a "." swapped with "->".
             | 
             | https://eli.thegreenplace.net/2007/11/24/the-context-
             | sensiti...
        
       | gwbas1c wrote:
       | What are your goals? Is this a learning project? Are you trying
       | to experiment with some kind of compiler architecture?
        
         | r1chardnl wrote:
         | Learning, hobby and fun.
         | 
         | Aspiration to create a language that I would love to use some
         | day.
        
           | d6ba56c039d9 wrote:
           | Here's an assignment for you.
           | 
           | Have your compiler generate high quality assembly language
           | for a DSP.
        
       | bruce343434 wrote:
       | Good luck with your project and language!
        
         | r1chardnl wrote:
         | Thanks, seeing it unfold whilst fixing and adding new features
         | and compiling your program into a executable is really cool.
         | 
         | Motivation is key though.
        
       | thamer wrote:
       | Just a few small suggestions:
       | 
       | 1. Some code seems to be duplicated (although with slightly
       | different formatting) across multiple files, like dd/dw/db in
       | elf.c and x86.c for example. I'd suggest consolidating those.
       | 
       | 2. It helps to define types when you have variables that can take
       | a limited set of possible values; typedef works but enums are
       | better. I'm thinking of variables like the `type` parameter to
       | `primitive_data_type_size`, for example. The compiler can help
       | you detect a missing case statement if you use an enum, but not
       | an int.
       | 
       | 3. It's a matter of preference, but typedef'ing your structs can
       | make your code more concise and not have it littered with struct
       | keywords.
       | 
       | 4. You seem to have a mix of tabs and spaces in some files (e.g.
       | in `elf.c`). I would recommend configuring your editor to use
       | just one of the two or it'll start to be an issue as your code
       | base grows.
       | 
       | 5. On a related point, I'd suggest picking a formatting style and
       | sticking to it. You have functions like `accept` in `ast.c`
       | declared without spaces after the opening `(` and others like
       | `read_file` in the same file that has spaces.
       | 
       | Good luck! This is a great way to learn.
        
       | MisterTea wrote:
       | > _Programming language that compiles into a x86 ELF executable._
       | 
       | > _a compiler somewhat resembling C_
       | 
       | > _Add new or borrow from other language(s) features ontop of C,
       | deviate away from just C_                 function strlen(const
       | char *s)
       | 
       | This does not appear to be a C compiler. The title should reflect
       | this.
        
         | r1chardnl wrote:
         | Can't edit my original reply anymore, but fair point, I've
         | changed the keyword 'function' in the AST to a function return
         | type instead. There's no error handling done yet to check
         | whether the variable receiving the function return data matches
         | the function return type, but it's on my todo list.
         | 
         | https://github.com/riicchhaarrd/ocean/commit/0618e0810c8d437...
         | 
         | Then again in C it would work aswell if you type casted the
         | types.                 const char *f()       {           return
         | (const char*)123;       }            int main()       {
         | int i = (int)f();           return 0;       }
        
         | tills13 wrote:
         | Yeesh. This is probably the most pedantic comment I've read on
         | HN.
         | 
         | Contributes absolutely nothing to the conversation, here.
        
         | r1chardnl wrote:
         | The main goal at the moment is to create a C compiler, the
         | function keyword there is just a placeholder for the time
         | being. At the moment anything returned is just mov'd into the
         | EAX register and then stored in a local variable if called that
         | way.
        
           | MisterTea wrote:
           | If you're looking for C-like languages to study look up the
           | plan 9 Alef language. It was a concurrent C-like that was
           | designed to replace C until they realized its better off as a
           | C library which is what plan 9's thread(2) is. It's mostly
           | extinct but survives:
           | 
           | http://doc.cat-v.org/plan_9/2nd_edition/papers/alef/
           | 
           | https://seh.dev/go-legacy/
        
             | spijdar wrote:
             | Worth pointing out Plan9's C was itself sort of its own
             | language. I think it was a strict subset, but I can't
             | remember all the details. The preprocessor was much
             | simpler, that much I remember.
        
               | mananaysiempre wrote:
               | Not exactly a subset, there were at least the anonymous
               | struct/union members that partially made it into C11 much
               | later.
               | 
               | But then if we look beyond the syntax almost any C
               | implementation is "its own language" in that it defines
               | behaviour the standard doesn't specify, and frequently
               | also changes behaviour the standard does, if in minor
               | ways. Like how the Windows ABI forces a noncompliant
               | wchar_t.
               | 
               | (Even if we look at the preprocessor, the classic
               | algorithm implemented in most places [that I can't be
               | bothered to find a link for] in fact expands more than
               | the standard strictly guarantees, _e.g._ you can
               | sometimes get recursive expansion out of it by creative
               | use of token pasting,--I remember reading the ANSI
               | committee thought the case was too "perverse" and didn't
               | specify anything [no link here as well, sorry... poke me
               | again if you actually care about this]. The standalone
               | preprocessor mcpp has warnings about this specification
               | hole, but other implementations don't as far as I know.
               | And you'd think the preprocessor, an almost purely
               | syntactic thing, wouldn't have implementation differences
               | worth keeping around.)
        
               | MisterTea wrote:
               | Yes, Plan 9 c has its own C library which does not follow
               | ANSI/ISO though it is very similar to ISO C. Its syntax
               | is pretty much c89 and the useful bits of c99. To build
               | ANSI/POSIX you use the APE (ANSI/POSIX Environment)
               | compilers and libraries which are built in. It's only
               | reason for existing is to help with porting Unix baggage.
               | 
               | The compiler is kencc which is why it is different. Plan
               | 9 also still uses a.out for binary images.
        
             | kubb wrote:
             | Oh, is _this_ where the inspiration for Go came from? It
             | would explain a lot.
        
               | kragen wrote:
               | Yes. Even before Alef, Newsqueak (01989) was almost
               | identical to Golang, and of course Alef, Limbo, etc.,
               | descend from Newsqueak.
        
             | r1chardnl wrote:
             | Looks really interesting, thanks for sharing this.
             | 
             | I'm not sure whether I'll incorporate everything from C or
             | just make it compatible with C and behind the scenes do
             | things differently e.g instead of carrying const char
             | */string pointers everywhere always carry the length of
             | said data structure along with it. Or for instance push an
             | additional pointer on the stack for memory zones, where an
             | allocator is always present and knows it's context, sort of
             | like __thiscall with class methods for this->.
        
           | jeffrallen wrote:
           | If you are exploring alternatives to the stdlib, you might
           | commit yourself to no zero terminated strings and see where
           | it takes you. A fundamental data type with a pointer, a
           | length and a capacity does a lot to stop buffer overflows.
        
             | guntars wrote:
             | To replace C-style strings you only need the length! You
             | also get cheap substrings which is such a common operation
             | that it alone IMO makes it worth taking this route.
        
       | justinlloyd wrote:
       | Great job! Writing a compiler from scratch is a massively
       | valuable learning experience. I wrote one in BASIC for the 6502,
       | that became self-hosting on April 24th, 1982. I'm going to watch
       | your project with interest and relive old memories.
        
       | kinga254 wrote:
       | Any specific resources you've used that you could recommend in
       | line with compiler construction?
        
         | optymizer wrote:
         | I wrote a C compiler* using flex [1] and bison [2] that
         | generated MIPS code. The biggest downside was the rather
         | complex glue code.
         | 
         | At some point ANTLR [3] looked promising, but these days I'd
         | probably write a lexer and recursive descent parser by hand,
         | then generate LLVM IR [4].
         | 
         | [1] https://github.com/westes/flex
         | 
         | [2] https://www.gnu.org/software/bison/
         | 
         | [3] https://www.antlr.org/
         | 
         | [4] https://llvm.org/
         | 
         | * for a subset of the full C89 spec
        
         | blcArmadillo wrote:
         | This is specific to interpreters but the beginning part related
         | to parsing and AST is common:
         | http://craftinginterpreters.com/contents.html
        
         | MH15 wrote:
         | I've had good luck with ANTLR4 and ObjWeb ASM for my JVM
         | language Imp https://github.com/mh15/imp
        
         | r1chardnl wrote:
         | You could use https://en.wikipedia.org/wiki/Yacc
         | 
         | I decided to write my own lexer/parser/compiler it's pretty
         | straightforward.
         | 
         | Sources I use(d) to program this, e.g look at resulting
         | assembly from other compilers or look at how the AST is
         | generated for other languages.
         | 
         | https://en.wikipedia.org/wiki/Recursive_descent_parser
         | 
         | https://en.cppreference.com/w/c/language/operator_precedence
         | 
         | https://godbolt.org/
         | 
         | https://astexplorer.net/
         | 
         | I didn't start writing C yesterday, and most of the things are
         | just stuff I've learned over the years and seeing how other
         | languages work and then just use what I know to try to program
         | in a logical manner.
         | 
         | Also I've written a implementation of a scripting language
         | (compiler and VM) before in similar fashion.
         | 
         | https://github.com/riicchhaarrd/gsc
        
       | [deleted]
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-08-03 23:00 UTC)