[HN Gopher] I wrote a SQL engine in Python
       ___________________________________________________________________
        
       I wrote a SQL engine in Python
        
       Author : marsupialtail_2
       Score  : 94 points
       Date   : 2022-12-30 20:08 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | einpoklum wrote:
       | > A library to parse and optimize SQL,
       | 
       | That's like saying "a library to parse and optimize computer
       | programs", except probably even harder, since a compiler and
       | runtime library can't make any assumptions about the programs
       | they need to make run, so they're limited in the potential of
       | utilizing all that context information.
       | 
       | Countless person-years have been spent on this and it's still a
       | very active fields of research and engineering.
       | 
       | > 2x SparkSQL performance
       | 
       | Ah, ok, so it can be slow. Never mind then, carry on :-P
        
         | captaintobs wrote:
         | I'm the author of SQLGlot. Optimizing SQL is quite common and
         | means things like projection / predicate push downs and logical
         | simplification. SQL can be optimized relatively easily compared
         | to a programming language because it is declarative. For
         | example
         | 
         | SELECT * FROM (SELECT * FROM x) WHERE z = 1
         | 
         | can be optimized into
         | 
         | SELECT * FROM x where z = 1
        
         | marsupialtail_2 wrote:
         | SQLGlot is really great, check it out!
        
       | sakras wrote:
       | I think the most interesting part of this project is the fault
       | tolerance. I can't say I've seen any other projects do this, but
       | it seems reasonable to want checkpointing during a long
       | computation.
       | 
       | Another thing I like is that conceptually it seems like it would
       | be simple to switch the underlying query engine (right now it's
       | Polars) in the future. Seems like a pretty general distributed
       | system.
        
         | marsupialtail_2 wrote:
         | Perhaps you can write the underlying query engine :-) in
         | something that's _not_ Python lol
        
         | marsupialtail_2 wrote:
         | Checkpointing is actually a bit difficult for fault tolerance
         | for long running batch computations and state tends to grow
         | linearly as the computation progresses, unlike streaming
         | applications where you typically use windows. This means
         | periodic checkpointing leads to quadratic amount of writes to
         | durable storage which is pretty bad.
         | 
         | Quokka supports checkpointing but does not enable that by
         | default to prevent this common problem from killing normal
         | operation performance.
        
       | zitterbewegung wrote:
       | Have you tested this with Jepsen?
        
         | marsupialtail_2 wrote:
         | The focus is olap and etl not oltp. In fact I assume data
         | doesn't change :-)
        
       | brunobowden wrote:
       | Python interpreter in SQL - the reverse of what was done -
       | would've been really impressive. Terrible idea of course but
       | impressive nonetheless.
        
         | hbirler wrote:
         | A Python to SQL compiler already exists https://db.cs.uni-
         | tuebingen.de/publications/2022/snakes-on-a...
        
           | marsupialtail_2 wrote:
           | I don't really compile python functions to sql -- I just
           | execute sql. It's just that the execution engine is written
           | in python
        
       | PontifexMinimus wrote:
       | Funnily enough I'm currently writing a NoSQL database in Python.
        
         | marsupialtail_2 wrote:
         | Name?
        
         | jkeddo wrote:
         | Why?
        
           | KMag wrote:
           | Most likely an autodidactic exercise. I'd love to see the
           | source, even if, or especially if, it's half-finished. Seeing
           | someone work their way through the snarls in such a project
           | would be useful.
        
       | theLiminator wrote:
       | Can you explain how this might differ from something like
       | https://github.com/apache/arrow-ballista
       | 
       | I've seen several variants of "next-gen" spark, but nowhere have
       | I really seen the different tradeoffs/advantages/disadvantages
       | between them.
        
         | marsupialtail_2 wrote:
         | Firstly quokka is pure Python, which I believe is good for
         | interoperability with Python based UDFs.
         | 
         | Secondly quokka tries to be fault tolerant. I.e. it can handle
         | worker failures intra query and not have to start over. This is
         | quite important in real world spark deployments with thousands
         | of nodes running many hours. AFAIK this is not well supported
         | by most spark alternatives.
         | 
         | Finally quokka has a much stronger focus on time series data
         | analytics. It is meant to excel at workloads like range joins
         | and asof/PIT joins used for feature engineering. (This part
         | isn't too stable yet so is not open source)
         | 
         | This means quokka optimizes on a different point in the
         | UDF/performance/fault tolerance tradeoff space than something
         | like arrow ballista or starrocks, which I think are pure
         | performance plays
        
           | theLiminator wrote:
           | Interesting, I was wondering if you considered building on
           | top of https://github.com/apache/arrow-datafusion-python
           | 
           | I really do think a distributed db with compute/storage
           | separation and optimized for feature engineering/dataloading
           | (for training NNs) is underserved.
           | 
           | I'd be very interested in the time series aspects of what
           | you're building.
        
             | marsupialtail_2 wrote:
             | hmm I wasn't aware of https://github.com/apache/arrow-
             | datafusion-python... thanks for the pointer.
             | 
             | time series target release by April this year. main
             | challenge is supporting them in the SQL API -- execution
             | engine support is already done
        
       ___________________________________________________________________
       (page generated 2022-12-30 23:00 UTC)