[HN Gopher] Query Engines: Push vs. Pull ___________________________________________________________________ Query Engines: Push vs. Pull Author : jsnell Score : 102 points Date : 2021-05-01 14:12 UTC (8 hours ago) (HTM) web link (justinjaffray.com) (TXT) w3m dump (justinjaffray.com) | Rusky wrote: | I like the use of generator functions for the pull-based example- | the syntax closely mirrors the push-based example, which drives | home the point that the difference between push and pull lies | entirely in whether you compose operators via an iterator API or | a callback API. You can also see this in how the two usage | examples are inside-out versions of each other. | | When you look at it from the perspective of a language designer | or compiler, things start to look even more similar: | | * Values that live across iterations are stored in a closure | (push) or generator object (pull) | | * In the first half of an iteration, before producing a result, | operators dispatch to their producers via a return address on the | call stack (push) or callbacks (pull) | | * In the second half of an iteration, while building up a result, | operators dispatch to their consumers via callbacks (push) or a | return address on the call stack (pull) | | The difference the article mentions between "unrolling" the two | approaches comes from the level of abstraction where inlining is | performed. The push model produces natural-looking output when | inlining the closures. The less-natural result of inlining the | pull model's `next` functions is where the control flow in loops | shows up. (See for example the performance benefits of adding a | push-model complement to Rust's usual pull-model Iterator trait: | https://github.com/rust-lang/rust/pull/42782#issuecomment-30...) | | But, you can get the same natural-looking output from the pull | model if you perform inlining at the level of generators! This is | a less familiar transformation than function inlining, but it's | still fundamentally the same idea, of erasing API boundaries and | ABI overhead- which in this case is a more straightforward | formulation of the various loop optimizations that can sometimes | get rid of the control-flow-in-loops produced by inlining `next` | functions. | | The difference around DAGs is similar. In the push model, a DAG | just means a producer calling into more than one consumer | callback. In the pull model, this is tricky because you can't | just yield (i.e. return from `next`) to more than one place- | return addresses form a stack! (And thus the dual problem for the | push model shows up for operations with multiple inputs, like the | merge join mentioned in the article.) | | Overall I think a more flexible approach might be to generalize | from `yield` and iterator APIs, to algebraic effects (or, lower | level, delimited continuations)- these let you build both push | and pull-like APIs for a single generator-like function, as well | as hybrid approaches that combine aspects of both, by removing | the requirement for a single privileged call stack. | rzzzt wrote: | LINQ also seems to have a push-based implementation, given its | FROM-WHERE-SELECT syntax: https://docs.microsoft.com/en- | us/dotnet/csharp/programming-g... | nathanaldensr wrote: | I don't think that is correct. Underlying much of LINQ is | _IEnumerable <T>_, which implements the classic _Current_ and | _MoveNext()_ iterator pattern. The _foreach_ keyword uses | _IEnumerable <T>_ to "pull" elements one at a time. | | LINQ's syntax is just syntactic sugar on top of chained method | calls to various interface extension methods. | rzzzt wrote: | Ahh, you are right. My eyes somehow drifted away from the | _pull_ example with yield above the "basic push-based query | engine" header, that is the one which I've found familiar. | mamcx wrote: | What caught my attention is the claim of push based queries to be | easier to generate imperative code, where I could learn about how | do that? | slver wrote: | I'm new to this terminology, but it looks a bit like lazily | computing a data stream vs. eagerly computing it, doesn't it? Or | am I wrong. | Ericson2314 wrote: | It is a little a bit like that. But the regular formulations of | "eager" and "lazy" assume expressions that are evaluated once | (or at at least if evaluated the again will yield the same | result). | | It's better to think of this stuff with constantly changing | underlying data, which adds a lot of "color" to the domain. | toddh wrote: | Nicely explained. I like the idle operators/row dichotomy. At a | system level push based systems suffer from the same weakness as | event driven systems: incompleteness. You can never be sure you | haven't missed an event because it was dropped during system | startup, network partition, failover, queue full, buffer | overflow, packet drop, etc. This doesn't always matter, but with | a pull based system you know the result is based on all the data, | but as you point out, that has a cost too. | cube2222 wrote: | Could you please explain at more length how push/pull plays a | role here? | | With push-based you'd still usually use lossless transports, | which could also have a "closing packet" which tells the | consumer that "this stream is done", so I don't see how pull- | based is better here. | | Re queue full: If I understand you correctly, queues are only a | problem when you start having multiple parallel stages one | after the other, with queues in-between. Then you have queueing | both with push-based and pull-based architectures. | cube2222 wrote: | I'm doing a rewrite[0] of OctoSQL[1], partly to make it push- | based instead of pull-based, and for now it's been much simpler | to work with, won't get into the detailed reasoning in this | comment. | | However, if somebody is interested, this is a great paper on how | to architect a query compiler: | https://www.cs.purdue.edu/homes/rompf/papers/tahboub-sigmod1... | | I found it very enlightening and simple to implement. The paper | also delves into details why the push-based architecture has more | room for optimizations. | | [0]:https://github.com/cube2222/octosql/tree/redesign | | [1]:https://github.com/cube2222/octosql/ | Ericson2314 wrote: | The truth is you need almost everything to do push and pull. Just | as pull alone does "forks" inefficiently, push alone does "joins" | inefficiently. | | People shy away from this conclusion because doing everything | both ways is quite hard and requires fancier abstractions to make | nice than people are familiar with. | Quekid5 wrote: | People have been vacillating on this forever and the truth | is... (drum roll) It depends. | scott0129 wrote: | > The truth is you need almost everything to do push and pull. | Just as pull alone does "forks" inefficiently, push alone does | "joins" inefficiently. | | That's really is a simple and intuitive summary. Thank you for | that insight! | legg0myegg0 wrote: | The DuckDB folks are migrating from pull to push and put together | this interesting documentation of their reasoning! They use a | vectorized model instead of a compiled one, so it's another | interesting comparison point. It seems like push will simplify | how they handle parallelism. | | https://github.com/duckdb/duckdb/issues/1583 | catblast01 wrote: | > This is the oldest and most well-known query execution model, | and was first proposed in 1994. | | Is the author younger than 25. Not picturing a world of databases | before 1994? Did the world even exist? | | I kid but regardless of when a paper describing some process was | published the "oldest" query execution model is decidedly a lot | older than 26 years. ___________________________________________________________________ (page generated 2021-05-01 23:00 UTC)