https://github.com/jamii/hytradboi-jam-2022 Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code + Explore + All features + Documentation + GitHub Skills + Blog * Solutions + By Plan + Enterprise + Teams + Compare all + By Solution + CI/CD & Automation + DevOps + DevSecOps + Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles + Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} jamii / hytradboi-jam-2022 Public * * Notifications * Fork 0 * Star 66 66 stars 0 forks Star Notifications * Code * Issues 0 * Pull requests 0 * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights jamii/hytradboi-jam-2022 This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags 1 branch 0 tags Code * Clone HTTPS GitHub CLI [https://github.com/j] Use Git or checkout with SVN using the web URL. [gh repo clone jamii/] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @jamii jamii ps ... 615e0ca Oct 3, 2022 ps 615e0ca Git stats * 187 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time bin Generate tokens from grammar Sep 28, 2022 deps/slt Clean up Oct 3, 2022 lib Clean up Oct 3, 2022 test Support CASE Oct 3, 2022 .gitignore Clean up Oct 3, 2022 .gitmodules Strip out imgui debugger stuff - not going to have time Sep 30, 2022 README.md ps Oct 3, 2022 build.zig Strip out imgui debugger stuff - not going to have time Sep 30, 2022 shell.nix Comment. Fmt. Sep 22, 2022 View code Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 README.md Goal: Pass all 4259065 tests in sqllogictest. From scratch. No dependencies. No surrender. Day 1 I wrote the test harness in advance, so my starting point on Monday morning was 4.2 million test failures. > zig build test_slt -Drelease-safe=true -- $(rg --files deps/slt) HashMap( error.Unimplemented => 4259065, ) passes => 0 Serious databases write their own parsers, but serious databases aren't finished in a week. We're going to need a shortcut. Let's grab this bnf extracted from the SQL 2016 spec. Only 9139 lines long. Also impossible to parse, because they don't escape tokens. Sometimes a [ is just a [, but sometimes it opens an optional group. ::= [ ::= < ::= [ ... ] Also check this out: ::= NATIONAL CHARACTER LARGE OBJECT [ ] | NCHAR LARGE OBJECT [ ] | NCLOB [ ] Yeah, they line-wrapped to 90 columns in the middle of , so I have to strip excess whitespace from names after parsing them. So parsing the bnf is kind of a mess, but I only have to parse this one bnf and not bnfs in general so I just mashed in a bunch of special cases. I don't really have a way of looking at the result yet, but I at least have the right number of definitions. So now I just have to use the bnf to parse sql. I don't really know how to do that because I always just hand-write recursive descent parsers whenever I need to parse something. The sql grammar is left-recursive so I can't just go top-down or it will potentially loop forever. But there is a neat trick that I heard when working with xtdb earlier this year. It's tricky to explain, but the intuition is similar to how iterative depth-first search avoids getting stuck in infinitely deep search trees. So whenever we hit a rule that might be left-recursive, we make an entry in the cache that says "parse failed". That forces it to explore other branches. If it manages to find some base case, we put that in the cache and try again to see if we can parse a bigger chunk of the input. If we ever get a result that is not better than the result already in the cache then there isn't any point exploring deeper trees. Here's the core of that loop. It took a while to debug the logic, but I think it makes sense now that I've worked through it. It took a lot more debugging to start successfully parsing actual sql: * My bnf parser had the wrong binding power for |. I had to turn it into a slightly Pratt parser to fix it. * The bnf from the spec doesn't describe actual sql that well (eg select 1; is invalid in the spec despite being accepted by every database since forever). So I've had to start tweaking the bnf to get the sqlite tests to parse. Here's where the scoreboard stands at the end of the day. > time zig build test_slt -Drelease-safe=true -- $(rg --files deps/slt) HashMap( error.Unimplemented => 172471, error.ParseError => 4086594, ) passes => 0 ________________________________________________________ Executed in 94.84 secs fish external usr time 94.33 secs 2.63 millis 94.33 secs sys time 0.53 secs 0.00 millis 0.53 secs Tomorrow I'll start working through those parse failures. Probably I'll need to build a little gui debugger to help, because trying to read through printlns of all the productions of that massive grammar is not time efficient. Day 2 I realized that I unthinkingly made the parser greedy. It will backtrack within an either, but if the first branch succeeds it will never go back to explore the second branch. That means when it hits something like avg(c) it parses the avg as an identifier and then can't backtrack to try to parse it as a function call later. SQL can be parsed with this minimal backtracking (that's how the materialize parser works) but the bnf grammar is not designed with this in mind and it would be way too much work to edit it by hand. I read a bunch about table driven parsers for arbitrary context-free grammars (because who knows what class of grammar this bnf is). Eventually I decided it would be safer to just add backtracking to the parser I already have, since that's a method I already understand. I flailed for a long while trying to figure out how to combine this with the memoized loop used for handling left-recursive definitions. Eventually I came up with what I think is a workable design, where the memoized value is stored on the parse stack and the loop works by adding backtrack points whenever the a recursive definition completes succesfully. I started on the implementation but at this point I'm just staring at the screen and not making forward progress, so I'm going to stop early and get some rest. My original plan looked like this: * sql parser (3 days) * name resolution (1 day) * type inference and lowering (1 day) * optimization and execution (1 day) * grind out the long tail of bugs (1 day) I estimated 3 days to get the sql parser working because I've been to sql-land before and I know what horrors lurk there, but I'm a little worried now that it will take even longer. Day 3 Yesterday I was trying to be all fancy make the backtracking actually backtrack (which is much harder when you don't have generators or lazy lists built into the language). Today I decided to just get on with it and generate every subparse. I got this working really quickly. Unfortunately, I then discovered that the bnf from the spec is wildly ambiguous. Check out the 7 different parses for select 1;. I ran the thing against the whole test suite just to see how badly I'm doing: HashMap( error.NoParse => 3971842, error.Unimplemented => 275660, error.AmbiguousParse => 11563, ) passes => 0 ________________________________________________________ Executed in 31.88 mins fish external usr time 31.77 mins 0.00 micros 31.77 mins sys time 0.10 mins 734.00 micros 0.10 mins It's really slow because every complex query produces a bazillion parse trees. And it's still not parsing most of the tests correctly. Plus, even if I get this whole thing working, I have to then turn those awful parse trees into something sane. So... this is not going to work. I didn't want to roll my own parser because we did that at materialize and while it worked out fine, it's more than 10kloc of rust. If I was just directly transcribing it I would have to type at 5 characters per second for 24 straight hours. I looked around at some other databases and even the ones that use parser generators are huge: > scc cockroachdb-parser/pkg/sql/scanner/ cockroachdb-parser/pkg/sql/parser/ ------------------------------------------------------------------------------- Language Files Lines Blanks Comments Code Complexity ------------------------------------------------------------------------------- Go 6 2305 202 356 1747 403 AWK 4 243 27 73 143 0 Bazel 2 183 11 9 163 0 Shell 2 49 13 8 28 1 Happy 1 14658 843 0 13815 0 Markdown 1 284 65 0 219 0 gitignore 1 8 1 2 5 0 ------------------------------------------------------------------------------- Total 17 17730 1162 448 16120 404 ------------------------------------------------------------------------------- Estimated Cost to Develop (organic) $500,413 Estimated Schedule Effort (organic) 10.572043 months Estimated People Required (organic) 4.205190 ------------------------------------------------------------------------------- Processed 506614 bytes, 0.507 megabytes (SI) ------------------------------------------------------------------------------- Except sqlite, which has a tiny little grammar definition. > scc sqlite/src/parse.y sqlite/tool/lemon.c sqlite/tool/lempar.c ------------------------------------------------------------------------------- Language Files Lines Blanks Comments Code Complexity ------------------------------------------------------------------------------- C 2 6961 356 1195 5410 916 Happy 1 1928 179 0 1749 0 ------------------------------------------------------------------------------- Total 3 8889 535 1195 7159 916 ------------------------------------------------------------------------------- Estimated Cost to Develop (organic) $213,397 Estimated Schedule Effort (organic) 7.647269 months Estimated People Required (organic) 2.479133 ------------------------------------------------------------------------------- Processed 285597 bytes, 0.286 megabytes (SI) ------------------------------------------------------------------------------- I only have to parse sqlite grammar, and even then probably only a fraction of it. So it looks like writing a parser generator might be plausible and by now it's really my only option if I want to get any tests passing at all. So here's my grammar so far. It's parsed by GrammarParser. In theory I could do this at comptime, but until we get a comptime allocator that's going to be a yak shave and I am way too many yak shaves deep already. So I laboriously but reliably write out all the rules into grammar.zig. The Tokenizer is written by hand and seems to be basically done - it runs without complaint on the entire test set. I might discover bugs there later, of course. The Parser is fun. There is a single parse function, but because it reads the rules from grammar.zig at compile time it gets specialized for each parse rule. Basically I got the same result as hand-generating the parser code, without having to splices a bunch of strings together. After the jam maybe I'll have some yak shave time to cut out grammar.zig entirely and do all the grammar stuff at comptime, and then it'll be a pretty sweet system. The best part about this is that I get really nice parse trees by generating rich types from the input grammar. Eg. pub const anon_18 = ?distinct_or_all; pub const anon_19 = ?from; pub const anon_20 = ?where; pub const anon_21 = ?group_by; pub const anon_22 = ?having; pub const anon_23 = ?window; pub const anon_24 = ?order_by; pub const anon_25 = ?limit; pub const select = struct { distinct_or_all: *anon_18, result_columns: *result_columns, from: *anon_19, where: *anon_20, group_by: *anon_21, having: *anon_22, window: *anon_23, order_by: *anon_24, limit: *anon_25, }; I still can't parse many of the tests because I haven't implemented expressions, operator precedence etc. So I'm still stuck in parsing land. But I can see the light at the end of the tunnel now. And at least it's fast: > time zig build test_slt -Drelease-safe=true -- $(rg --files deps/slt) HashMap( error.ParseError => 4257825, error.Unimplemented => 1240, ) passes => 0 ________________________________________________________ Executed in 11.85 secs fish external usr time 11.49 secs 0.44 millis 11.49 secs sys time 0.39 secs 1.79 millis 0.39 secs I'm going to spend tomorrow fleshing out as much of the parsing as possible and then switch to trying to analyze and execute some of the simpler tests. Day 4 I lost a lot of time in the morning to segfaults in the zig compiler. I eventually worked around it by moving more stuff from comptime to codegen. I fleshed out the parser and grammar a bunch more. I added memoization again to handle left recursion, as well as a ton of debugging output. I was making pretty good progress through parsing more and more queries when something I added caused some queries to loop seemingly forever in the parser. I've been bangind my head against this for hours and getting nowhere. I suspect something in the interaction between memoization and the stack of expr_* types that handle precedence in the grammar. Maybe I'll figure it out tomorrow. It's not looking good for passing any tests at all, let alone all of them. The light at the end of the tunnel was a train. --------------------------------------------------------------------- I figured it out at home. I was barking up totally the wrong tree. The expr grammar is actually not left-recursive after the changes I made, which means the memo is not being applied. But the memo was covering up the fact that I wrote the grammar in a silly way that required exponential time to parse with memoization. Which means I actually don't need memoization (which I spent most of today working on) at all for my grammar. This whole left-recursive memoization thing has been a total dead end. I should have stuck with techniques I'm familiar with from the start. This is not the time to be trying experimental new things. The silver lining is this: HashMap( error.ParseError => 3725149, error.Unimplemented => 533916, ) passes => 0 ________________________________________________________ Executed in 50.07 secs fish external usr time 49.83 secs 1.75 millis 49.82 secs sys time 0.23 secs 1.90 millis 0.23 secs That's about 12.5% of the tests parsing! Day 5 I made some changes to the parser generator, added a whole bunch of debugging tools to the parser itself and then sat down and cranked on the grammar till I can parse 100% of the tests. I'm finally out of the damn tunnel. Then I sketched out a framework for planning and evaluation and filled out just enough to handle scalar-only queries. HashMap( error.NoPlan => 4066003, ) skips => 3869887 passes => 193062 ________________________________________________________ Executed in 121.77 secs fish external usr time 120.77 secs 0.02 millis 120.77 secs sys time 0.96 secs 1.01 millis 0.96 secs 4.5% of tests are passing! This is how I expected the week to go. Not death by bnf. The odds of hitting 100% now are very low - there is a long tail of weird behavior that I expected to have much more time to implement. But I think in the next two days I might be able to get to something respectable. Day 6 Full steam ahead: total => 5939714 HashMap( error.TypeError => 198542, error.NoPlan => 2345772, error.StatementShouldError => 2, error.ParseError => 2, error.TestExpectedEqual => 9379, error.BadColumn => 42749, ) skips => 127880 passes => 3215388 (= 54.13%) ________________________________________________________ Executed in 581.30 secs fish external usr time 258.12 secs 660.00 micros 258.12 secs sys time 342.46 secs 324.00 micros 342.46 secs What are we looking at here? total => 5939714. Some tests are marked with tags like onlyif sqlite or skipif sqlite. Before I was only looking at tests with no tags (ie generic sql), but today I found out that tags are often used to split out setup code like: onlyif oracle statement ok CREATE TABLE foo(...funky oracle workaround...) skipif oracle statement ok CREATE TABLE foo(...regular sql...) I was skipping both, which left me with no table. So now I'm doing any statements tagged with skipif, which leaves me with a lot more tests to pass. passes => 3215388 (= 54.13%). That's a lot of progress. skips => 127880. Most of the slt files start with some statements to set up some data, followed by a bunch of data. If I know that I screwed up one of those statements, I still try to execute the queries but I assign any errors to the skip category so I'm not debugging the wrong sql. error.NoPlan => 2345772. Things that my planner does not support yet. error.TypeError => 198542. Sqlite has some hilarious implicit type casts. I didn't implement these, so sometimes my queries fail when sqlite tries to eg add a boolean to a string. Some of these might also be from the planner miscounting columns though. sqlite> select TRUE + "foo"; 1 error.BadColumn => 42749. Sometimes I generate plans that refer to non-existent columns. I catch these during eval so that the bounds check doesn't kill my test run. error.TestExpectedEqual => 9379. I produced the wrong output for the test. Surprisingly few of these. error.ParseError => 2. I added a bunch of new tests, so I had to expand the grammar some more. I didn't catch everything. error.StatementShouldError => 2. Some statement was supposed to abort with an error but it didn't. This will probably screw up other tests in the same file. --------------------------------------------------------------------- The major changes I made today were supporting most statements, almost all scalar expressions, SELECT/FROM/WHERE and some forms of non-correlated subqueries. I also added memory (via std.heap.GeneralPurposeAllocator) and cpu limits (via counting backwards branches) to evaluation, to prevent the odd bad plan from trashing my test run. With one day left to go I have to be strategic about what I support next. A few days ago I dumped a list of grammar rules by how many test cases they occur in. The list is probably a bit out of date with the new added tests, but hopefully directionally correct. The big ones I'm missing are: * ORDER BY = 804266 * JOIN = 143153 * GROUP = 108561 * subquery = 97229 * function call = 57144 * VIEW = 57144 (this is probably responsible for most of the skips though) * UNION/INTERSECT/EXCEPT = 20440 * DELETE = 19877 (again, probably responsble for a lot of skips and /or wrong results) * CASE = 16811 * HAVING = 13375 So tomorrow I'll probably just try and crank through features in that order and see where I end up. Day 7 I completely rewrote the way name resolution works to handle the addition of all the weird edge cases in select_body. I can now handle stuff like select a, count(a+1) as count, a+2 as c from foo where c > 0 order by b, count. (Think about what order those scalar expressions have to get executed in). I'll go into depth about that in the final writeup. Then I just cranked. I added way more detail to the error reports so I could priority sort missing features and then I typed a lot of code. > git diff 3b25276a871f752a700e7a4942d51fd50ce63f5 --stat nix-shell README.md | 7 + lib/sql.zig | 22 +- lib/sql/Evaluator.zig | 286 +++++++++++++++++++--- lib/sql/Planner.zig | 1097 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------- lib/sql/grammar.txt | 2 +- lib/sql/grammar.zig | 21 +- test/slt.zig | 9 +- 7 files changed, 1188 insertions(+), 256 deletions(-) Funnily enough, it looks like I've been averaging about 1000 lines per day: > scc ./lib nix-shell ------------------------------------------------------------------------------- Language Files Lines Blanks Comments Code Complexity ------------------------------------------------------------------------------- Zig 8 6476 185 106 6185 708 Plain Text 3 390 17 0 373 0 ------------------------------------------------------------------------------- Total 11 6866 202 106 6558 708 ------------------------------------------------------------------------------- Estimated Cost to Develop (organic) $194,627 Estimated Schedule Effort (organic) 7.384347 months Estimated People Required (organic) 2.341581 ------------------------------------------------------------------------------- Processed 268876 bytes, 0.269 megabytes (SI) ------------------------------------------------------------------------------- I implemented ORDER BY, GROUP BY (without grouped columns), functions, aggregates, CASE, CAST, correlated IN, INNER/CROSS JOIN and various edge cases in sqlite's implicit coercions and comparisons. I got pretty damn close. > zig build generate_grammar && time zig build test_slt -Drelease-safe=true -- $(rg --files deps/slt | sort -h) ... total => 5939714 HashMap( error.NoPlanOther => 133170, error.TestExpectedEqual => 16820, error.NoPlanExprAtom => 2848, error.OutOfJuice => 2162, error.WrongNumberOfColumnsReturned => 1681, error.NoPlanCompound => 1000, error.NoPlanConstraint => 550, error.NoPlanLeft => 191, error.BadEvalAggregate => 59, error.NoPlanStatement => 32, error.ParseError => 2, error.StatementShouldError => 2, ) skips => 123015 passes => 5658182 (= 95.26%) ________________________________________________________ Executed in 18.64 mins fish external usr time 462.87 secs 0.00 micros 462.87 secs sys time 658.11 secs 891.00 micros 658.11 secs Man, if I hadn't spent more than half the week in that parsing rabbit hole I really think I might have made it to 100%. But this'll do. --------------------------------------------------------------------- Detailed writeup to follow probably in a week or two. About No description, website, or topics provided. Resources Readme Stars 66 stars Watchers 1 watching Forks 0 forks Releases No releases published Sponsor this project Sponsor Learn more about GitHub Sponsors Packages 0 No packages published Languages * Zig 99.6% * Nix 0.4% Footer (c) 2022 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.