[HN Gopher] Query Engines: Push vs. Pull
       ___________________________________________________________________
        
       Query Engines: Push vs. Pull
        
       Author : jsnell
       Score  : 102 points
       Date   : 2021-05-01 14:12 UTC (8 hours ago)
        
 (HTM) web link (justinjaffray.com)
 (TXT) w3m dump (justinjaffray.com)
        
       | Rusky wrote:
       | I like the use of generator functions for the pull-based example-
       | the syntax closely mirrors the push-based example, which drives
       | home the point that the difference between push and pull lies
       | entirely in whether you compose operators via an iterator API or
       | a callback API. You can also see this in how the two usage
       | examples are inside-out versions of each other.
       | 
       | When you look at it from the perspective of a language designer
       | or compiler, things start to look even more similar:
       | 
       | * Values that live across iterations are stored in a closure
       | (push) or generator object (pull)
       | 
       | * In the first half of an iteration, before producing a result,
       | operators dispatch to their producers via a return address on the
       | call stack (push) or callbacks (pull)
       | 
       | * In the second half of an iteration, while building up a result,
       | operators dispatch to their consumers via callbacks (push) or a
       | return address on the call stack (pull)
       | 
       | The difference the article mentions between "unrolling" the two
       | approaches comes from the level of abstraction where inlining is
       | performed. The push model produces natural-looking output when
       | inlining the closures. The less-natural result of inlining the
       | pull model's `next` functions is where the control flow in loops
       | shows up. (See for example the performance benefits of adding a
       | push-model complement to Rust's usual pull-model Iterator trait:
       | https://github.com/rust-lang/rust/pull/42782#issuecomment-30...)
       | 
       | But, you can get the same natural-looking output from the pull
       | model if you perform inlining at the level of generators! This is
       | a less familiar transformation than function inlining, but it's
       | still fundamentally the same idea, of erasing API boundaries and
       | ABI overhead- which in this case is a more straightforward
       | formulation of the various loop optimizations that can sometimes
       | get rid of the control-flow-in-loops produced by inlining `next`
       | functions.
       | 
       | The difference around DAGs is similar. In the push model, a DAG
       | just means a producer calling into more than one consumer
       | callback. In the pull model, this is tricky because you can't
       | just yield (i.e. return from `next`) to more than one place-
       | return addresses form a stack! (And thus the dual problem for the
       | push model shows up for operations with multiple inputs, like the
       | merge join mentioned in the article.)
       | 
       | Overall I think a more flexible approach might be to generalize
       | from `yield` and iterator APIs, to algebraic effects (or, lower
       | level, delimited continuations)- these let you build both push
       | and pull-like APIs for a single generator-like function, as well
       | as hybrid approaches that combine aspects of both, by removing
       | the requirement for a single privileged call stack.
        
       | rzzzt wrote:
       | LINQ also seems to have a push-based implementation, given its
       | FROM-WHERE-SELECT syntax: https://docs.microsoft.com/en-
       | us/dotnet/csharp/programming-g...
        
         | nathanaldensr wrote:
         | I don't think that is correct. Underlying much of LINQ is
         | _IEnumerable <T>_, which implements the classic _Current_ and
         | _MoveNext()_ iterator pattern. The _foreach_ keyword uses
         | _IEnumerable <T>_ to "pull" elements one at a time.
         | 
         | LINQ's syntax is just syntactic sugar on top of chained method
         | calls to various interface extension methods.
        
           | rzzzt wrote:
           | Ahh, you are right. My eyes somehow drifted away from the
           | _pull_ example with yield above the "basic push-based query
           | engine" header, that is the one which I've found familiar.
        
       | mamcx wrote:
       | What caught my attention is the claim of push based queries to be
       | easier to generate imperative code, where I could learn about how
       | do that?
        
       | slver wrote:
       | I'm new to this terminology, but it looks a bit like lazily
       | computing a data stream vs. eagerly computing it, doesn't it? Or
       | am I wrong.
        
         | Ericson2314 wrote:
         | It is a little a bit like that. But the regular formulations of
         | "eager" and "lazy" assume expressions that are evaluated once
         | (or at at least if evaluated the again will yield the same
         | result).
         | 
         | It's better to think of this stuff with constantly changing
         | underlying data, which adds a lot of "color" to the domain.
        
       | toddh wrote:
       | Nicely explained. I like the idle operators/row dichotomy. At a
       | system level push based systems suffer from the same weakness as
       | event driven systems: incompleteness. You can never be sure you
       | haven't missed an event because it was dropped during system
       | startup, network partition, failover, queue full, buffer
       | overflow, packet drop, etc. This doesn't always matter, but with
       | a pull based system you know the result is based on all the data,
       | but as you point out, that has a cost too.
        
         | cube2222 wrote:
         | Could you please explain at more length how push/pull plays a
         | role here?
         | 
         | With push-based you'd still usually use lossless transports,
         | which could also have a "closing packet" which tells the
         | consumer that "this stream is done", so I don't see how pull-
         | based is better here.
         | 
         | Re queue full: If I understand you correctly, queues are only a
         | problem when you start having multiple parallel stages one
         | after the other, with queues in-between. Then you have queueing
         | both with push-based and pull-based architectures.
        
       | cube2222 wrote:
       | I'm doing a rewrite[0] of OctoSQL[1], partly to make it push-
       | based instead of pull-based, and for now it's been much simpler
       | to work with, won't get into the detailed reasoning in this
       | comment.
       | 
       | However, if somebody is interested, this is a great paper on how
       | to architect a query compiler:
       | https://www.cs.purdue.edu/homes/rompf/papers/tahboub-sigmod1...
       | 
       | I found it very enlightening and simple to implement. The paper
       | also delves into details why the push-based architecture has more
       | room for optimizations.
       | 
       | [0]:https://github.com/cube2222/octosql/tree/redesign
       | 
       | [1]:https://github.com/cube2222/octosql/
        
       | Ericson2314 wrote:
       | The truth is you need almost everything to do push and pull. Just
       | as pull alone does "forks" inefficiently, push alone does "joins"
       | inefficiently.
       | 
       | People shy away from this conclusion because doing everything
       | both ways is quite hard and requires fancier abstractions to make
       | nice than people are familiar with.
        
         | Quekid5 wrote:
         | People have been vacillating on this forever and the truth
         | is... (drum roll) It depends.
        
         | scott0129 wrote:
         | > The truth is you need almost everything to do push and pull.
         | Just as pull alone does "forks" inefficiently, push alone does
         | "joins" inefficiently.
         | 
         | That's really is a simple and intuitive summary. Thank you for
         | that insight!
        
       | legg0myegg0 wrote:
       | The DuckDB folks are migrating from pull to push and put together
       | this interesting documentation of their reasoning! They use a
       | vectorized model instead of a compiled one, so it's another
       | interesting comparison point. It seems like push will simplify
       | how they handle parallelism.
       | 
       | https://github.com/duckdb/duckdb/issues/1583
        
       | catblast01 wrote:
       | > This is the oldest and most well-known query execution model,
       | and was first proposed in 1994.
       | 
       | Is the author younger than 25. Not picturing a world of databases
       | before 1994? Did the world even exist?
       | 
       | I kid but regardless of when a paper describing some process was
       | published the "oldest" query execution model is decidedly a lot
       | older than 26 years.
        
       ___________________________________________________________________
       (page generated 2021-05-01 23:00 UTC)