[HN Gopher] I wrote a SQL engine in Python ___________________________________________________________________ I wrote a SQL engine in Python Author : marsupialtail_2 Score : 94 points Date : 2022-12-30 20:08 UTC (2 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | einpoklum wrote: | > A library to parse and optimize SQL, | | That's like saying "a library to parse and optimize computer | programs", except probably even harder, since a compiler and | runtime library can't make any assumptions about the programs | they need to make run, so they're limited in the potential of | utilizing all that context information. | | Countless person-years have been spent on this and it's still a | very active fields of research and engineering. | | > 2x SparkSQL performance | | Ah, ok, so it can be slow. Never mind then, carry on :-P | captaintobs wrote: | I'm the author of SQLGlot. Optimizing SQL is quite common and | means things like projection / predicate push downs and logical | simplification. SQL can be optimized relatively easily compared | to a programming language because it is declarative. For | example | | SELECT * FROM (SELECT * FROM x) WHERE z = 1 | | can be optimized into | | SELECT * FROM x where z = 1 | marsupialtail_2 wrote: | SQLGlot is really great, check it out! | sakras wrote: | I think the most interesting part of this project is the fault | tolerance. I can't say I've seen any other projects do this, but | it seems reasonable to want checkpointing during a long | computation. | | Another thing I like is that conceptually it seems like it would | be simple to switch the underlying query engine (right now it's | Polars) in the future. Seems like a pretty general distributed | system. | marsupialtail_2 wrote: | Perhaps you can write the underlying query engine :-) in | something that's _not_ Python lol | marsupialtail_2 wrote: | Checkpointing is actually a bit difficult for fault tolerance | for long running batch computations and state tends to grow | linearly as the computation progresses, unlike streaming | applications where you typically use windows. This means | periodic checkpointing leads to quadratic amount of writes to | durable storage which is pretty bad. | | Quokka supports checkpointing but does not enable that by | default to prevent this common problem from killing normal | operation performance. | zitterbewegung wrote: | Have you tested this with Jepsen? | marsupialtail_2 wrote: | The focus is olap and etl not oltp. In fact I assume data | doesn't change :-) | brunobowden wrote: | Python interpreter in SQL - the reverse of what was done - | would've been really impressive. Terrible idea of course but | impressive nonetheless. | hbirler wrote: | A Python to SQL compiler already exists https://db.cs.uni- | tuebingen.de/publications/2022/snakes-on-a... | marsupialtail_2 wrote: | I don't really compile python functions to sql -- I just | execute sql. It's just that the execution engine is written | in python | PontifexMinimus wrote: | Funnily enough I'm currently writing a NoSQL database in Python. | marsupialtail_2 wrote: | Name? | jkeddo wrote: | Why? | KMag wrote: | Most likely an autodidactic exercise. I'd love to see the | source, even if, or especially if, it's half-finished. Seeing | someone work their way through the snarls in such a project | would be useful. | theLiminator wrote: | Can you explain how this might differ from something like | https://github.com/apache/arrow-ballista | | I've seen several variants of "next-gen" spark, but nowhere have | I really seen the different tradeoffs/advantages/disadvantages | between them. | marsupialtail_2 wrote: | Firstly quokka is pure Python, which I believe is good for | interoperability with Python based UDFs. | | Secondly quokka tries to be fault tolerant. I.e. it can handle | worker failures intra query and not have to start over. This is | quite important in real world spark deployments with thousands | of nodes running many hours. AFAIK this is not well supported | by most spark alternatives. | | Finally quokka has a much stronger focus on time series data | analytics. It is meant to excel at workloads like range joins | and asof/PIT joins used for feature engineering. (This part | isn't too stable yet so is not open source) | | This means quokka optimizes on a different point in the | UDF/performance/fault tolerance tradeoff space than something | like arrow ballista or starrocks, which I think are pure | performance plays | theLiminator wrote: | Interesting, I was wondering if you considered building on | top of https://github.com/apache/arrow-datafusion-python | | I really do think a distributed db with compute/storage | separation and optimized for feature engineering/dataloading | (for training NNs) is underserved. | | I'd be very interested in the time series aspects of what | you're building. | marsupialtail_2 wrote: | hmm I wasn't aware of https://github.com/apache/arrow- | datafusion-python... thanks for the pointer. | | time series target release by April this year. main | challenge is supporting them in the SQL API -- execution | engine support is already done ___________________________________________________________________ (page generated 2022-12-30 23:00 UTC)