[HN Gopher] Function pipelines: Building functional programming ...
       ___________________________________________________________________
        
       Function pipelines: Building functional programming into PostgreSQL
        
       Author : oconnor663
       Score  : 252 points
       Date   : 2021-10-19 15:29 UTC (7 hours ago)
        
 (HTM) web link (blog.timescale.com)
 (TXT) w3m dump (blog.timescale.com)
        
       | eska wrote:
       | Cool stuff!
       | 
       | Out of curiosity: how does the composition of the sort function
       | work? Can you avoid knowing all rows at the point it occurs?
        
         | Lockerman wrote:
         | Not yet, right now all datapoints must be known at the during
         | the sort
        
       | tytrdev wrote:
       | _laughs in datomic_
        
         | dkersten wrote:
         | Not everyone can use datomic though, for many reasons
         | (including cost). I'd love to use datomic, but its just not an
         | option for me, while postgres is (and the DigitalOcean hosted
         | postgres is quite affordable too and supports TimescaleDB too).
        
           | claytonjy wrote:
           | Would that be the community version of Timescale then?
           | 
           | Looks like this is a whole new extension, does DO allow you
           | to install your own, or do you think it'd be easy to make a
           | case for them adding this one? (I'm not a DO user myself)
        
             | djk447 wrote:
             | (NB: Post author here)
             | 
             | This is in the TimescaleDB Toolkit extension [1] which is
             | licensed under our community license for now and it's not
             | available on DO. It is available on our cloud service fully
             | managed. You can also install it and run it for free
             | yourself.
             | 
             | [1]: https://github.com/timescale/timescaledb-toolkit
        
         | ithrow wrote:
         | Until you use the very slow and extremely buggy black box in
         | production.
        
       | darksaints wrote:
       | This is really freaking cool. I'm not often impressed by clever
       | hacks in SQL, but this solves a really annoying problem and the
       | specific way which you solved this problem with just custom
       | functions and operators is ingenious and reassuring, as opposed
       | to some of the other crazy ideas that I've seen to extend SQL.
       | 
       | I would love to see more about how you implemented the lambdas,
       | as that's the one thing that is still non-obvious to me.
        
         | djk447 wrote:
         | (NB: Post author here)
         | 
         | We're thinking about doing a post on them at some point in the
         | future!
         | 
         | Any specific questions we should answer?
        
           | darksaints wrote:
           | I guess mostly how the syntax works. The `$$` thing I mostly
           | recognize as being a code block in custom functions, but I'm
           | not sure how it works in the context of a lambda. Also, is
           | `$value` just a special identifier for the input?
           | 
           | A separate question...is it possible to have tuple-valued
           | timevectors?
        
             | djk447 wrote:
             | The docs have a bit more on it [1]. The $$ is really just a
             | dollar quoted string[2], you can use any sort of string in
             | there, but dollar quoting just makes it a lot easier to
             | read.
             | 
             | On your other question, not right now, but if it's
             | something you want, file a github issue, it's something
             | we've thought a bit about and it gets hairy pretty quickly,
             | but also increases the power of these things dramatically.
             | We'll probably do something with it, but it may be a little
             | while and we may let this stuff bake for a bit first...
             | 
             | [1]: https://docs.timescale.com/timescaledb/latest/how-to-
             | guides/... [2]:https://www.postgresqltutorial.com/dollar-
             | quoted-string-cons...
        
       | ghop02 wrote:
       | While I agree the partition syntax may not be super
       | straightforward, adding a new operator for another way to sort,
       | sum, and abs etc. seems like chaos
        
         | claytonjy wrote:
         | I think "may not be super straightforward" undersells the
         | learning curve here. I took to SQL pretty naturally but gnarly
         | partitioning is still the easiest way for me to get tripped up.
         | Having taught a lot of data sciencey folks advanced SQL, even
         | simple uses of partitioning tend to be a big barrier compared
         | to what comes before.
        
         | djk447 wrote:
         | (NB: Post author here)
         | 
         | Totally understand the concern, for now, it's just for a
         | particular datatype, all of the other operations are happening
         | normally inside your query and you can still use window
         | functions...
         | 
         | And btw, I love window functions and will probably be doing a
         | post on them, but they can be a little bit weird and have some
         | really odd ordering properties with aggregates etc.
         | 
         | The point of this is really much more about providing choices
         | for folks. But yeah, if we find that people are just getting
         | really confused we could re-consider, that's part of why we
         | released it experimentally.
        
           | akulkarni wrote:
           | Also, the team picked `->` as the operator, which I thought
           | was a very natural and intuitive choice :-)
           | 
           | e.g.,                 SELECT device_id, timevector(ts, val)
           | -> sort() -> delta() -> abs() -> sum() as volatility
        
             | OJFord wrote:
             | What if those functions (were something that) returned
             | json{,b}?
             | 
             | Oh or I see timevector returns a custom datatype, so I
             | suppose the answer is it has to be one of a fixed number of
             | provided types?
        
               | djk447 wrote:
               | (NB: post author here)
               | 
               | We're planning on providing various outputs, json is
               | definitely one format we'd want to support, want to file
               | a github issue with the layout that you'd want and we can
               | start getting some feedback to implement?
        
               | Lockerman wrote:
               | > Oh or I see timevector returns a custom datatype, so I
               | suppose the answer is it has to be one of a fixed number
               | of provided types?
               | 
               | Exactly! These aren't general arithmetic operators,
               | they're specifically for mapping arithmetic operations
               | over timevectors.
        
       | ecoulthard wrote:
       | I'm looking for a more efficient way of detecting errors in
       | metrics that store running totals, such as Odometer. I use the
       | delta of Odometer to compute Distance. But decreases in Odometer
       | causes incorrect Distance calculations. Odometer should never
       | decrease. It is an error if Odometer decreases. Would function
       | pipelines be able to help me detect decreases in Odometer?
        
         | djk447 wrote:
         | (NB: Post author here)
         | 
         | Yes! You can use the `delta()` element followed by a `filter()`
         | with a lambda for looking for negative numbers I think!
        
       | jbylund wrote:
       | What's the explain plan for a pipeline like this look like?
       | 
       | In my experience one of the biggest things scaring devs away from
       | moving more logic over the wire towards the db is lubrication
       | between version control and database state. If I end up defining
       | 10s or 100s of custom pipeline operators in git how do I keep
       | that in sync with the db? and am I responsible for doing all that
       | legwork myself?
        
       | 1point618 wrote:
       | This is immediately useful. The number of awkwardly nested
       | subqueries that I write because I don't have this kind of
       | functionality in BigQuery is a huge bummer.
       | 
       | I am curious, how many Haskell programmers do you all have? And
       | more seriously, do you have any plans to help provide this
       | functionality to non-Postgres SQLs, or at least help those trying
       | to take inspiration from it?
        
         | fifilura wrote:
         | I wonder if the array functions in the presto engine (used for
         | example in Amazon Athena) should be able to handle this too?
         | 
         | https://prestodb.io/docs/0.217/functions/array.html
        
           | fifilura wrote:
           | Ok, a 15 minute hack. It may be possible to make it look
           | nicer.
           | 
           | And it looked better in TFA, so good job!
           | SELECT device_id,              reduce(
           | zip_with(                    array_sort(zip(array_agg(ts),
           | array_agg(val))),
           | slice(array_sort(zip(array_agg(ts), array_agg(val))),2,
           | 5000),                 (a,b) -> abs(a.field1 - b.field1))
           | , 0, (s, x) -> s + COALESCE(x,0), s -> s) as res         FROM
           | (             VALUES              (1, 2, 1),             (1,
           | 3, 0),             (1, 9, 3),             (1, 3, 4),
           | (2, 2, 1),             (2, 8, 0),             (3, 4, 3)
           | ) AS t(device_id, val, ts)
           | 
           | GROUP BY device_id                   device_id res         1
           | 14         3        0         2        6
        
         | jpitz wrote:
         | Minor nitpick: Nested subqueries ARE awkward, which is why I
         | would express                   SELECT device id,
         | sum(abs_delta) as volatility         FROM (             SELECT
         | device_id, abs(val - lag(val) OVER (PARTITION BY device_id
         | ORDER BY ts)) as abs_delta              FROM measurements
         | WHERE ts >= now() - '1 day'::interval) calc_delta )
         | GROUP BY device_id;
         | 
         | this way:                   WITH temperature_delta_past_day AS
         | (             SELECT device_id, abs(val - lag(val) OVER
         | (PARTITION BY device_id ORDER BY ts)) as abs_delta
         | FROM measurements             WHERE ts >= now() - '1
         | day'::interval  //edit - the remainder of this line is a typo:
         | ) calc_delta         )         SELECT device id, sum(abs_delta)
         | as volatility         FROM temperature_delta_past_day
         | GROUP BY device_id;
         | 
         | To me, it is a lot more natural to use SQL's CTE syntax to
         | 'predefine' my projections before making use of them, instead
         | of trying to define them inline - like the difference between
         | when you'd define a lambda directly inline vs defining a
         | separate function for it.
         | 
         | I don't know if trying to embed a DataFrame-esque api inside of
         | SQL is a thing that would benefit me, but it is an interesting
         | idea.
        
           | jpruittsql wrote:
           | CTEs are wonderful for readability, but until recent versions
           | of PostgreSQL they were always "materialized" which can have
           | performance implications vs. subqueries. The NOT MATERIALIZED
           | option to CTEs was added in PostgreSQL 12. (Timescale
           | Engineer)
        
           | nextos wrote:
           | Came here to say the same thing. Some years back, to pass my
           | CS databases course, we were required to write really complex
           | queries in the computer lab. We had very limited time to do
           | so.
           | 
           | Many people failed because their queries became complex
           | monoliths, hard to debug or optimize when things went wrong.
           | 
           | That's because they limited themselves to SQL-92. We were
           | using Oracle, so there was no reason not to use SQL:1999. I
           | made heavy use of WITH, and it was quite effortless.
        
           | pmontra wrote:
           | I'm also using CTEs a lot precisely because subqueries are
           | hard to reason about, and these pipelines are in turn so much
           | easier too.
        
           | fifilura wrote:
           | I agree. There is something magic with CTEs, just moving
           | things around a little bit makes it so much easier to
           | comprehend.
           | 
           | You could even move the val and lag(val) to two different
           | columns and do the abs() in the summary.
           | 
           | That way you can query temperature_delta_past_day and see for
           | yourself what it does.
        
           | djk447 wrote:
           | (NB: Post author here)
           | 
           | Yeah. CTEs definitely make it a bit easier to read, though
           | some people get more confused by them, especially because
           | they don't exist in all SQL variants.
           | 
           | And totally agree with that last bit! We want to see if it's
           | useful for folks, it's released experimentally now and we'll
           | see what folks can do with it. One thing that's fun and that
           | we may do a post explaining a bit more is that these
           | pipelines are actually values as well, so the transforms that
           | you run can be stored in a column in the database as well.
           | 
           | And that starts offering some really mind-bending stuff. The
           | example I used was building on the one in the post except now
           | you have thermocouples with different calibration curves. You
           | can actually store a polynomial or other calibration curve in
           | a column and apply the correct calibration to each individual
           | thermocouple with a JOIN...which is kinda crazy, but pretty
           | awesome. So we want to figure out how to use these and what
           | people can do with them and see where it takes us.
        
             | jpitz wrote:
             | They've been around quite a long time, but yeah it seems
             | like only SQL89 gets taught.
             | 
             | Window Functions and CTEs are both major force multipliers
             | in the language, so I always encourage folks to go learn
             | them.
        
           | EE84M3i wrote:
           | Note that your second example has a misplaced pair of parens:
           | WITH temperature_delta_past_day AS         (
           | SELECT device_id, abs(val - lag(val) OVER (PARTITION BY
           | device_id ORDER BY ts)) as abs_delta              FROM
           | measurements             WHERE ts >= now() - '1
           | day'::interval) calc_delta         )         SELECT device
           | id, sum(abs_delta) as volatility         FROM
           | temperature_delta_past_day         GROUP BY device_id;
           | 
           | should probably be                   WITH
           | temperature_delta_past_day AS         (             SELECT
           | device_id, abs(val - lag(val) OVER (PARTITION BY device_id
           | ORDER BY ts)) as abs_delta              FROM measurements
           | WHERE ts >= now() - '1 day'::interval         )
           | SELECT device id, sum(abs_delta) as volatility         FROM
           | temperature_delta_past_day         GROUP BY device_id;
        
             | jpitz wrote:
             | You're correct - I was in a hurry to express the idea, and
             | I failed to check the where clause.
        
         | djk447 wrote:
         | (NB: Post author here)
         | 
         | We don't have many Haskell programmers, we mostly work in Rust,
         | C and Go, but we're always open to new things...
         | 
         | This is pretty Postgres specific. From the beginning Postgres
         | has focused on extensibility and allowed this sort of stuff
         | with custom functions/operators/types. Many other SQL variants
         | don't have that. It's one of the main things that sets Postgres
         | apart from other databases, see Stonebreaker's great history of
         | this, specifically the stuff on Object Relational databases
         | [1].
         | 
         | We're pretty focused on building on top of Postgres because of
         | that functionality. We do have some other stuff to make
         | Postgres more scalable, and you're welcome to try us out, but
         | if there's something specific that BigQuery offers that you
         | need feel free to file a github issue around that too. But
         | yeah, no plans to do things like this in other databases, they
         | just don't have the infrastructure...
         | 
         | [1]:https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/pap
         | e...
        
           | akulkarni wrote:
           | (TimescaleDB co-founder)
           | 
           | Agree - what would you need to see in PostgreSQL /
           | TimescaleDB to enable you to switch from Big Query? We are
           | all ears :-)
        
             | methyl wrote:
             | Not the op, but infinite scalability without any
             | infrastructure work required besides loading the data is
             | what holds me on BigQuery
        
               | akulkarni wrote:
               | Thanks for sharing.
               | 
               | "infinite scalability" is something we are aiming for in
               | our new vision for Timescale Cloud:
               | 
               | https://blog.timescale.com/blog/announcing-the-new-
               | timescale...
               | 
               | We are not there today, but we are making progress.
               | 
               | Any other feedback - please let us know :-)
        
       | btown wrote:
       | To do something similar without custom extensions, it's worth
       | knowing about Postgres's lateral joins, which allow you to reuse
       | columns from the left side of your join in the right side. It
       | basically realizes the "enrich my results" promise that SQL
       | should always have had, and makes incredibly flexible analytics
       | queries possible without ever needing to indent more than once!
       | 
       | https://heap.io/blog/postgresqls-powerful-new-join-type-late...
       | 
       | https://ddrscott.github.io/blog/2017/what-the-sql-lateral/
        
       | chrisjc wrote:
       | So does this functional SQL just get "transpiled" into analytical
       | SQL? If so, where does "transpilation" occur, in the client,
       | intercepted before hitting the query engine, etc? Or has the
       | query engine itself been modified to handle functional SQL?
       | 
       | If the query engine itself has been modified, are there any
       | performance gains or losses?
       | 
       | This is a pretty interesting and compelling idea. I do wonder if
       | the functional SQL approach might entice developers to write very
       | bad/expensive/ineffective queries. Then again, nothing is
       | stopping developers from doing the same with analytical SQL.
        
         | ccleve wrote:
         | Duplicating my question to Lockerman below. Sorry for the
         | double post.
         | 
         | Is the -> operator just syntactic sugar? Is this:
         | 
         | SELECT device_id, timevector(ts, val) -> sort() -> delta() ->
         | abs() -> sum() as volatility
         | 
         | the same as this?
         | 
         | SELECT device_id, sum(abs(delta(sort(timevector(ts, val))))) as
         | volatility
        
           | djk447 wrote:
           | Sorry I missed before was out grabbing lunch, answered above!
        
           | WireBaron wrote:
           | This is essentially correct. It's a little more complicated
           | inside, and the expansion's really more like:
           | 
           | SELECT device_id, arrow_run_pipeline(timevector(ts, val),
           | arrow_add_element(sort(), arrow_add_element(delta(),
           | arrow_add_element(abs(), sum())));
           | 
           | The notable difference here is that this presents a lot more
           | optimization potential, as the entire pipeline can
           | conceivably be applied in one pass through the table.
        
         | djk447 wrote:
         | (NB: Post author here)
         | 
         | So this is a great question and we may do a whole post on this
         | at some point in the future. There are definitely ways to shoot
         | yourself in the foot, but hopefully we can avoid at least some
         | of them.
         | 
         | The main way this happens now is that the pipeline elements
         | actually return a custom data type that represents the composed
         | set of functions, and we can (for at least part of the
         | pipeline) handle simplification and combination of that into a
         | single set of functions to run over the `timevector`.
         | 
         | With that said, there isn't great support for common
         | subexpression elimination in PG right now, we're hoping we can
         | solve some of that with both subqueries and maybe some fun
         | stuff in the planner/executor, but we're going to try to solve
         | that as we think about moving this from experimental into
         | stable.
         | 
         | Right now we hardly modify the query engine at all, these are
         | just normal functions that are run and they'll probably remain
         | that way for a while. The optimization here will happen more
         | inside the datastructure or within a row to make sure we're not
         | scanning the same `timevector` multiple times ideally. We do
         | hope to add in some vectorization and other bits as we keep
         | going but not 100% where that's going to go yet. I'm also going
         | to ask a colleague to respond and clarify this cause he's
         | working more heavily on it and I know some of the bits but only
         | some of it...
        
           | chrisjc wrote:
           | > return a custom data type that represents the composed set
           | of functions, and we can (for at least part of the pipeline)
           | handle simplification and combination of that into a single
           | set of functions to run over the `timevector`
           | 
           | I'm really having trouble wrapping my head around this idea.
           | I guess it speaks volumes about the potential and power of
           | PostgreSQL in general.
           | 
           | Has PostgreSQL ever been used like this in the past, or is
           | this really thinking outside of the box?
           | 
           | We don't use TimeScale or have any intention of in the
           | immediate future (Snowflake provides 99% of what we need
           | atm), but I'm always interested and impressed to hear what
           | they're doing.
        
             | djk447 wrote:
             | (NB: Post author here)
             | 
             | Yeah. It's a bit mind bending I guess. Will try to explain
             | but might just need a longer format thing and we will
             | probably do a piece on this in the future.
             | 
             | Essentially, each function returns a special
             | "PipelineElement" datatype that identifies the function,
             | but not what it's acting on. When you string them together,
             | it returns a "PipelineElement" that represents the
             | composition of all the functions with the correct ordering
             | etc. That can then be applied to the incoming `timevector`.
             | 
             | As far as I know, no one's done this before. I don't think
             | it's a completely new idea, Stonebreaker thought about how
             | you can treat code as data and thought that bringing data
             | and code closer together was an important goal of the
             | Postgres project, but this specific way of doing it is
             | pretty new and out there I think!
             | 
             | I think most people would think that in order to do this
             | you'd need to do something like modify the parser or
             | something more invasive.
             | 
             | (We do use one planner hook on the function call, the
             | expression simplification one that allows us to convert
             | something like `timevector(ts, val) -> sort() -> delta() ->
             | abs() -> sum()` into `timevector(ts, val) -> (sort() ->
             | delta() -> abs() -> sum())` so that all the operations can
             | be done in a single pass. But that's pretty minor. If we
             | need to figure something out around the common
             | subexpression elimination we may need to hook into the
             | planner a little more, we'll see).
        
               | [deleted]
        
               | djk447 wrote:
               | Oh yeah, and building on this, this is also the source of
               | the ability to actually store the pipeline element in the
               | database, say in a column and apply it later that I
               | mentioned in a comment below. That's one of the cool
               | mind-bending things this allows.
               | 
               | The example I gave below was building on the example in
               | the post, where we had thermometers. Except now think
               | about if you have thermocouples and each thermocouple had
               | its own calibration curve, and they were different types
               | so they might even have different types of equations
               | providing the calibration curve.
               | 
               | You can now store the pipeline that does the correct
               | adjustments and calculates the actual temperature
               | separately for each `device_id` and JOIN to the
               | "calibrations" table to apply the calculation.
               | 
               | It's kinda crazy, but really cool and I'm hoping to do a
               | post on that whole thing at some point to encourage
               | people to play around with it and see what sorts of
               | things it allows...
        
               | derefr wrote:
               | For your next magic trick, I'd suggest using these to
               | build a Postgres equivalent to Lucene's persisted
               | levenstein automata :)
        
               | [deleted]
        
         | [deleted]
        
         | Lockerman wrote:
         | (Timescale engineer here)
         | 
         | To summarize a bit on what David said here[1]: there are no
         | modifications to the query engine, this is all using Postgres's
         | custom operator support.
         | 
         | https://news.ycombinator.com/item?id=28920110
        
           | ccleve wrote:
           | Is this just syntactic sugar, then?
           | 
           | Is this:
           | 
           | SELECT device_id, timevector(ts, val) -> sort() -> delta() ->
           | abs() -> sum() as volatility
           | 
           | the same as this?
           | 
           | SELECT device_id, sum(abs(delta(sort(timevector(ts, val)))))
           | as volatility
        
             | djk447 wrote:
             | (NB Post author)
             | 
             | Kinda. It's close to that, except in order to make it do
             | that we had to actually make the functions return special
             | types so it's more equivalent to something like
             | 
             | SELECT device_id, apply(sum, apply(abs, apply(delta,
             | apply(sort, timevector(ts, val))))))
             | 
             | Where each of the items in there is an instance of our
             | special "PipelineElement" type, which defines the function
             | it's applying.
             | 
             | Does that make any sense at all? Not sure if I'm explaining
             | this well...
        
               | ccleve wrote:
               | It does make sense. Thanks.
        
         | stingraycharles wrote:
         | I can assume that this special syntax allows for better
         | optimizations; the SQL variant is much more flexible, which
         | means you can make much less assumptions, and thus less room
         | for optimizations.
         | 
         | Not sure if they're doing that (especially for the v1 version
         | of what looks to be primarily a syntactic selling point, rather
         | than a perf selling point). But that's what I would do.
        
           | djk447 wrote:
           | (NB: post author here)
           | 
           | Yep! you got it exactly right! We're doing it for syntax
           | first and then hopefully go back and start doing interesting
           | performance optimizations on top. There are definitely some
           | things we're thinking about.
        
           | chrisjc wrote:
           | Wow, great explanation! Now I really like this idea!
        
       | ccleve wrote:
       | It appears that all these extra functions are written in Rust,
       | but the core Timescale product is written in C. Why did you do it
       | that way? Would you do it differently if you were starting from
       | scratch today?
        
         | djk447 wrote:
         | (NB: Post author here!)
         | 
         | This is all true! We have a bit more info on that choice in
         | this post [1]. But in general the Toolkit extension [2] is
         | meant to have a bit of a lighter touch than the core
         | TimescaleDB extension [3], it's meant for moving faster and
         | working almost entirely with the custom
         | functions/types/operators type approach rather than the core
         | extension which is integrated much more deeply into the planner
         | & executor and deals with a number of other hooks. It would be
         | possible to do that in Rust, but because it's so deeply
         | integrated with the main Postgres codebase there would just be
         | a lot of translation going on and you'd get a lot less of the
         | benefits of working in Rust. So C is more natural for much of
         | the core DB extension.
         | 
         | PGX [4] is also a huge help for writing this sort of extension.
         | The point of Toolkit is really to move much faster, try lots of
         | things, mostly using the more SQL facing API (CREATE FUNCTION,
         | CREATE AGGREGATE and the like). And PGX helps manage all of
         | that really nicely. The rest of Rust being available is also
         | great and we like working in the language, but without PGX it'd
         | be a lot harder and we couldn't move nearly as fast.
         | 
         | [1]: https://blog.timescale.com/blog/introducing-
         | hyperfunctions-n... [2]:
         | https://github.com/timescale/timescaledb-toolkit [3]:
         | https://github.com/timescale/timescaledb [4]:
         | https://github.com/zombodb/pgx
        
       ___________________________________________________________________
       (page generated 2021-10-19 23:00 UTC)