[HN Gopher] Show HN: A work-in-progress C compiler from scratch ___________________________________________________________________ Show HN: A work-in-progress C compiler from scratch Author : r1chardnl Score : 131 points Date : 2021-08-03 12:50 UTC (10 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | jjice wrote: | This is just an anecdote, but I had to write a SQL DDL and DML | parser during my last semester. I wrote the DDL parser by hand, | and it wasn't as bad as I expected, but it was time consuming. I | managed to convince the professor to give us the option of using | a parser generator for the next phase (DML) since the point of | the class wasn't parsing context free grammars and more focused | on executing the SQL. | | I used Flex and Bison since the project was in C. Getting up and | running and understanding how the tools have to be set up with | different options was a bit tricky, but after that, my parser was | up and running in about two hours, compared to probably four | times that for the hand written DDL. Our DML subset was also much | larger and more complex than our DDL, so I was very happy with | the development speed increase. | | I had this idea that using a parser generator was slow and | wasteful since many modern tutorials online write them by hand | and speak against parser generators (possibly because there isn't | a catch all for all languages). Turns out dev speed is way more | important to me up front, because in the case that I notice | parsing speed actually being an issue I should be happy that my | MVP has gotten enough use. | | It's also nice because a lexer and parser can be pretty easily | black-boxed and swapped out for your hand written, just keep the | AST and API the same and you should be good. | | All that said, that's personal preferences and writing the parser | by hand is definitely good experience and more extensible, | especially for error handling. Nice work! | brundolf wrote: | To add an anecdote, I've gotten into a hobby of writing parsers | and parser-related things, and I've developed a pattern and a | handful of utility functions (which are small enough to be re- | written from memory) that together make hand-writing a pretty | breezy task. Just the other week at work I hand-wrote a GraphQL | schema parser from scratch in about 2 hours using this | technique (GraphQL's schemas have a relatively simple syntax, | but I still think it's a decent data point) | qorrect wrote: | Is it hosted somewhere we can see it? | brundolf wrote: | It's not, though I've thought about open-sourcing or doing | a write-up on my pattern, but I've just assumed there are | tons of libraries out there for this that are probably | better than my homegrown version. Could still make an | interesting blog post though, I suppose | WalterBright wrote: | > dev speed is way more important to me up front | | Having written many parsers, I can attest that the time spent | writing and debugging the parser is about 0.0000001% of the | time you'll invest in the compiler. | | Heck, I spent more time trying to figure out how to integrate | the tag name space from ImportC into the host D compiler's | symbol table mechanism, than I did writing the C parser. | dboreham wrote: | Your post came through a wormhole from 1979. | SavantIdiot wrote: | Every time I see a compiler, the first thing a zero in on is the | parser. | | I don't mean this in a bad way, but that tokenizer looks a bit | ... simple. ;-) | | I say that because ~26 years ago I had a go at writing a C | preprocessor with the goal of creating a ToC and Index for my C | projects. I sat down with a BNF of C and lex/yacc and got ground | up in the details, then found some software that did exactly what | I wanted and moved on. I just remember a few parts of the C | syntax that don't fit nicely into BNF and hadn't studied parsers | enough to dig myself out of the hole. | | Good luck doing it manually, that's gonna be tricky as heck. | fao_ wrote: | Parsing it manually is actually the preferred choice for most | compilers, both GCC and Clang do it manually, for example. | SavantIdiot wrote: | Really? That's probably why I failed! Ironically (given my | post) I've never looked at Clang's parser. I have looked at | GCC's and, well, sheepishly: I don't get it, it is a beast of | a project and I wasn't able to crack the architecture. I | should try again tho. | rot13xor wrote: | C is almost context-free with the exception of typedef. A | custom parser helps with error diagnostics, e.g. missed | semicolon or a "." swapped with "->". | | https://eli.thegreenplace.net/2007/11/24/the-context- | sensiti... | gwbas1c wrote: | What are your goals? Is this a learning project? Are you trying | to experiment with some kind of compiler architecture? | r1chardnl wrote: | Learning, hobby and fun. | | Aspiration to create a language that I would love to use some | day. | d6ba56c039d9 wrote: | Here's an assignment for you. | | Have your compiler generate high quality assembly language | for a DSP. | bruce343434 wrote: | Good luck with your project and language! | r1chardnl wrote: | Thanks, seeing it unfold whilst fixing and adding new features | and compiling your program into a executable is really cool. | | Motivation is key though. | thamer wrote: | Just a few small suggestions: | | 1. Some code seems to be duplicated (although with slightly | different formatting) across multiple files, like dd/dw/db in | elf.c and x86.c for example. I'd suggest consolidating those. | | 2. It helps to define types when you have variables that can take | a limited set of possible values; typedef works but enums are | better. I'm thinking of variables like the `type` parameter to | `primitive_data_type_size`, for example. The compiler can help | you detect a missing case statement if you use an enum, but not | an int. | | 3. It's a matter of preference, but typedef'ing your structs can | make your code more concise and not have it littered with struct | keywords. | | 4. You seem to have a mix of tabs and spaces in some files (e.g. | in `elf.c`). I would recommend configuring your editor to use | just one of the two or it'll start to be an issue as your code | base grows. | | 5. On a related point, I'd suggest picking a formatting style and | sticking to it. You have functions like `accept` in `ast.c` | declared without spaces after the opening `(` and others like | `read_file` in the same file that has spaces. | | Good luck! This is a great way to learn. | MisterTea wrote: | > _Programming language that compiles into a x86 ELF executable._ | | > _a compiler somewhat resembling C_ | | > _Add new or borrow from other language(s) features ontop of C, | deviate away from just C_ function strlen(const | char *s) | | This does not appear to be a C compiler. The title should reflect | this. | r1chardnl wrote: | Can't edit my original reply anymore, but fair point, I've | changed the keyword 'function' in the AST to a function return | type instead. There's no error handling done yet to check | whether the variable receiving the function return data matches | the function return type, but it's on my todo list. | | https://github.com/riicchhaarrd/ocean/commit/0618e0810c8d437... | | Then again in C it would work aswell if you type casted the | types. const char *f() { return | (const char*)123; } int main() { | int i = (int)f(); return 0; } | tills13 wrote: | Yeesh. This is probably the most pedantic comment I've read on | HN. | | Contributes absolutely nothing to the conversation, here. | r1chardnl wrote: | The main goal at the moment is to create a C compiler, the | function keyword there is just a placeholder for the time | being. At the moment anything returned is just mov'd into the | EAX register and then stored in a local variable if called that | way. | MisterTea wrote: | If you're looking for C-like languages to study look up the | plan 9 Alef language. It was a concurrent C-like that was | designed to replace C until they realized its better off as a | C library which is what plan 9's thread(2) is. It's mostly | extinct but survives: | | http://doc.cat-v.org/plan_9/2nd_edition/papers/alef/ | | https://seh.dev/go-legacy/ | spijdar wrote: | Worth pointing out Plan9's C was itself sort of its own | language. I think it was a strict subset, but I can't | remember all the details. The preprocessor was much | simpler, that much I remember. | mananaysiempre wrote: | Not exactly a subset, there were at least the anonymous | struct/union members that partially made it into C11 much | later. | | But then if we look beyond the syntax almost any C | implementation is "its own language" in that it defines | behaviour the standard doesn't specify, and frequently | also changes behaviour the standard does, if in minor | ways. Like how the Windows ABI forces a noncompliant | wchar_t. | | (Even if we look at the preprocessor, the classic | algorithm implemented in most places [that I can't be | bothered to find a link for] in fact expands more than | the standard strictly guarantees, _e.g._ you can | sometimes get recursive expansion out of it by creative | use of token pasting,--I remember reading the ANSI | committee thought the case was too "perverse" and didn't | specify anything [no link here as well, sorry... poke me | again if you actually care about this]. The standalone | preprocessor mcpp has warnings about this specification | hole, but other implementations don't as far as I know. | And you'd think the preprocessor, an almost purely | syntactic thing, wouldn't have implementation differences | worth keeping around.) | MisterTea wrote: | Yes, Plan 9 c has its own C library which does not follow | ANSI/ISO though it is very similar to ISO C. Its syntax | is pretty much c89 and the useful bits of c99. To build | ANSI/POSIX you use the APE (ANSI/POSIX Environment) | compilers and libraries which are built in. It's only | reason for existing is to help with porting Unix baggage. | | The compiler is kencc which is why it is different. Plan | 9 also still uses a.out for binary images. | kubb wrote: | Oh, is _this_ where the inspiration for Go came from? It | would explain a lot. | kragen wrote: | Yes. Even before Alef, Newsqueak (01989) was almost | identical to Golang, and of course Alef, Limbo, etc., | descend from Newsqueak. | r1chardnl wrote: | Looks really interesting, thanks for sharing this. | | I'm not sure whether I'll incorporate everything from C or | just make it compatible with C and behind the scenes do | things differently e.g instead of carrying const char | */string pointers everywhere always carry the length of | said data structure along with it. Or for instance push an | additional pointer on the stack for memory zones, where an | allocator is always present and knows it's context, sort of | like __thiscall with class methods for this->. | jeffrallen wrote: | If you are exploring alternatives to the stdlib, you might | commit yourself to no zero terminated strings and see where | it takes you. A fundamental data type with a pointer, a | length and a capacity does a lot to stop buffer overflows. | guntars wrote: | To replace C-style strings you only need the length! You | also get cheap substrings which is such a common operation | that it alone IMO makes it worth taking this route. | justinlloyd wrote: | Great job! Writing a compiler from scratch is a massively | valuable learning experience. I wrote one in BASIC for the 6502, | that became self-hosting on April 24th, 1982. I'm going to watch | your project with interest and relive old memories. | kinga254 wrote: | Any specific resources you've used that you could recommend in | line with compiler construction? | optymizer wrote: | I wrote a C compiler* using flex [1] and bison [2] that | generated MIPS code. The biggest downside was the rather | complex glue code. | | At some point ANTLR [3] looked promising, but these days I'd | probably write a lexer and recursive descent parser by hand, | then generate LLVM IR [4]. | | [1] https://github.com/westes/flex | | [2] https://www.gnu.org/software/bison/ | | [3] https://www.antlr.org/ | | [4] https://llvm.org/ | | * for a subset of the full C89 spec | blcArmadillo wrote: | This is specific to interpreters but the beginning part related | to parsing and AST is common: | http://craftinginterpreters.com/contents.html | MH15 wrote: | I've had good luck with ANTLR4 and ObjWeb ASM for my JVM | language Imp https://github.com/mh15/imp | r1chardnl wrote: | You could use https://en.wikipedia.org/wiki/Yacc | | I decided to write my own lexer/parser/compiler it's pretty | straightforward. | | Sources I use(d) to program this, e.g look at resulting | assembly from other compilers or look at how the AST is | generated for other languages. | | https://en.wikipedia.org/wiki/Recursive_descent_parser | | https://en.cppreference.com/w/c/language/operator_precedence | | https://godbolt.org/ | | https://astexplorer.net/ | | I didn't start writing C yesterday, and most of the things are | just stuff I've learned over the years and seeing how other | languages work and then just use what I know to try to program | in a logical manner. | | Also I've written a implementation of a scripting language | (compiler and VM) before in similar fashion. | | https://github.com/riicchhaarrd/gsc | [deleted] | [deleted] ___________________________________________________________________ (page generated 2021-08-03 23:00 UTC)