[HN Gopher] Function pipelines: Building functional programming ... ___________________________________________________________________ Function pipelines: Building functional programming into PostgreSQL Author : oconnor663 Score : 252 points Date : 2021-10-19 15:29 UTC (7 hours ago) (HTM) web link (blog.timescale.com) (TXT) w3m dump (blog.timescale.com) | eska wrote: | Cool stuff! | | Out of curiosity: how does the composition of the sort function | work? Can you avoid knowing all rows at the point it occurs? | Lockerman wrote: | Not yet, right now all datapoints must be known at the during | the sort | tytrdev wrote: | _laughs in datomic_ | dkersten wrote: | Not everyone can use datomic though, for many reasons | (including cost). I'd love to use datomic, but its just not an | option for me, while postgres is (and the DigitalOcean hosted | postgres is quite affordable too and supports TimescaleDB too). | claytonjy wrote: | Would that be the community version of Timescale then? | | Looks like this is a whole new extension, does DO allow you | to install your own, or do you think it'd be easy to make a | case for them adding this one? (I'm not a DO user myself) | djk447 wrote: | (NB: Post author here) | | This is in the TimescaleDB Toolkit extension [1] which is | licensed under our community license for now and it's not | available on DO. It is available on our cloud service fully | managed. You can also install it and run it for free | yourself. | | [1]: https://github.com/timescale/timescaledb-toolkit | ithrow wrote: | Until you use the very slow and extremely buggy black box in | production. | darksaints wrote: | This is really freaking cool. I'm not often impressed by clever | hacks in SQL, but this solves a really annoying problem and the | specific way which you solved this problem with just custom | functions and operators is ingenious and reassuring, as opposed | to some of the other crazy ideas that I've seen to extend SQL. | | I would love to see more about how you implemented the lambdas, | as that's the one thing that is still non-obvious to me. | djk447 wrote: | (NB: Post author here) | | We're thinking about doing a post on them at some point in the | future! | | Any specific questions we should answer? | darksaints wrote: | I guess mostly how the syntax works. The `$$` thing I mostly | recognize as being a code block in custom functions, but I'm | not sure how it works in the context of a lambda. Also, is | `$value` just a special identifier for the input? | | A separate question...is it possible to have tuple-valued | timevectors? | djk447 wrote: | The docs have a bit more on it [1]. The $$ is really just a | dollar quoted string[2], you can use any sort of string in | there, but dollar quoting just makes it a lot easier to | read. | | On your other question, not right now, but if it's | something you want, file a github issue, it's something | we've thought a bit about and it gets hairy pretty quickly, | but also increases the power of these things dramatically. | We'll probably do something with it, but it may be a little | while and we may let this stuff bake for a bit first... | | [1]: https://docs.timescale.com/timescaledb/latest/how-to- | guides/... [2]:https://www.postgresqltutorial.com/dollar- | quoted-string-cons... | ghop02 wrote: | While I agree the partition syntax may not be super | straightforward, adding a new operator for another way to sort, | sum, and abs etc. seems like chaos | claytonjy wrote: | I think "may not be super straightforward" undersells the | learning curve here. I took to SQL pretty naturally but gnarly | partitioning is still the easiest way for me to get tripped up. | Having taught a lot of data sciencey folks advanced SQL, even | simple uses of partitioning tend to be a big barrier compared | to what comes before. | djk447 wrote: | (NB: Post author here) | | Totally understand the concern, for now, it's just for a | particular datatype, all of the other operations are happening | normally inside your query and you can still use window | functions... | | And btw, I love window functions and will probably be doing a | post on them, but they can be a little bit weird and have some | really odd ordering properties with aggregates etc. | | The point of this is really much more about providing choices | for folks. But yeah, if we find that people are just getting | really confused we could re-consider, that's part of why we | released it experimentally. | akulkarni wrote: | Also, the team picked `->` as the operator, which I thought | was a very natural and intuitive choice :-) | | e.g., SELECT device_id, timevector(ts, val) | -> sort() -> delta() -> abs() -> sum() as volatility | OJFord wrote: | What if those functions (were something that) returned | json{,b}? | | Oh or I see timevector returns a custom datatype, so I | suppose the answer is it has to be one of a fixed number of | provided types? | djk447 wrote: | (NB: post author here) | | We're planning on providing various outputs, json is | definitely one format we'd want to support, want to file | a github issue with the layout that you'd want and we can | start getting some feedback to implement? | Lockerman wrote: | > Oh or I see timevector returns a custom datatype, so I | suppose the answer is it has to be one of a fixed number | of provided types? | | Exactly! These aren't general arithmetic operators, | they're specifically for mapping arithmetic operations | over timevectors. | ecoulthard wrote: | I'm looking for a more efficient way of detecting errors in | metrics that store running totals, such as Odometer. I use the | delta of Odometer to compute Distance. But decreases in Odometer | causes incorrect Distance calculations. Odometer should never | decrease. It is an error if Odometer decreases. Would function | pipelines be able to help me detect decreases in Odometer? | djk447 wrote: | (NB: Post author here) | | Yes! You can use the `delta()` element followed by a `filter()` | with a lambda for looking for negative numbers I think! | jbylund wrote: | What's the explain plan for a pipeline like this look like? | | In my experience one of the biggest things scaring devs away from | moving more logic over the wire towards the db is lubrication | between version control and database state. If I end up defining | 10s or 100s of custom pipeline operators in git how do I keep | that in sync with the db? and am I responsible for doing all that | legwork myself? | 1point618 wrote: | This is immediately useful. The number of awkwardly nested | subqueries that I write because I don't have this kind of | functionality in BigQuery is a huge bummer. | | I am curious, how many Haskell programmers do you all have? And | more seriously, do you have any plans to help provide this | functionality to non-Postgres SQLs, or at least help those trying | to take inspiration from it? | fifilura wrote: | I wonder if the array functions in the presto engine (used for | example in Amazon Athena) should be able to handle this too? | | https://prestodb.io/docs/0.217/functions/array.html | fifilura wrote: | Ok, a 15 minute hack. It may be possible to make it look | nicer. | | And it looked better in TFA, so good job! | SELECT device_id, reduce( | zip_with( array_sort(zip(array_agg(ts), | array_agg(val))), | slice(array_sort(zip(array_agg(ts), array_agg(val))),2, | 5000), (a,b) -> abs(a.field1 - b.field1)) | , 0, (s, x) -> s + COALESCE(x,0), s -> s) as res FROM | ( VALUES (1, 2, 1), (1, | 3, 0), (1, 9, 3), (1, 3, 4), | (2, 2, 1), (2, 8, 0), (3, 4, 3) | ) AS t(device_id, val, ts) | | GROUP BY device_id device_id res 1 | 14 3 0 2 6 | jpitz wrote: | Minor nitpick: Nested subqueries ARE awkward, which is why I | would express SELECT device id, | sum(abs_delta) as volatility FROM ( SELECT | device_id, abs(val - lag(val) OVER (PARTITION BY device_id | ORDER BY ts)) as abs_delta FROM measurements | WHERE ts >= now() - '1 day'::interval) calc_delta ) | GROUP BY device_id; | | this way: WITH temperature_delta_past_day AS | ( SELECT device_id, abs(val - lag(val) OVER | (PARTITION BY device_id ORDER BY ts)) as abs_delta | FROM measurements WHERE ts >= now() - '1 | day'::interval //edit - the remainder of this line is a typo: | ) calc_delta ) SELECT device id, sum(abs_delta) | as volatility FROM temperature_delta_past_day | GROUP BY device_id; | | To me, it is a lot more natural to use SQL's CTE syntax to | 'predefine' my projections before making use of them, instead | of trying to define them inline - like the difference between | when you'd define a lambda directly inline vs defining a | separate function for it. | | I don't know if trying to embed a DataFrame-esque api inside of | SQL is a thing that would benefit me, but it is an interesting | idea. | jpruittsql wrote: | CTEs are wonderful for readability, but until recent versions | of PostgreSQL they were always "materialized" which can have | performance implications vs. subqueries. The NOT MATERIALIZED | option to CTEs was added in PostgreSQL 12. (Timescale | Engineer) | nextos wrote: | Came here to say the same thing. Some years back, to pass my | CS databases course, we were required to write really complex | queries in the computer lab. We had very limited time to do | so. | | Many people failed because their queries became complex | monoliths, hard to debug or optimize when things went wrong. | | That's because they limited themselves to SQL-92. We were | using Oracle, so there was no reason not to use SQL:1999. I | made heavy use of WITH, and it was quite effortless. | pmontra wrote: | I'm also using CTEs a lot precisely because subqueries are | hard to reason about, and these pipelines are in turn so much | easier too. | fifilura wrote: | I agree. There is something magic with CTEs, just moving | things around a little bit makes it so much easier to | comprehend. | | You could even move the val and lag(val) to two different | columns and do the abs() in the summary. | | That way you can query temperature_delta_past_day and see for | yourself what it does. | djk447 wrote: | (NB: Post author here) | | Yeah. CTEs definitely make it a bit easier to read, though | some people get more confused by them, especially because | they don't exist in all SQL variants. | | And totally agree with that last bit! We want to see if it's | useful for folks, it's released experimentally now and we'll | see what folks can do with it. One thing that's fun and that | we may do a post explaining a bit more is that these | pipelines are actually values as well, so the transforms that | you run can be stored in a column in the database as well. | | And that starts offering some really mind-bending stuff. The | example I used was building on the one in the post except now | you have thermocouples with different calibration curves. You | can actually store a polynomial or other calibration curve in | a column and apply the correct calibration to each individual | thermocouple with a JOIN...which is kinda crazy, but pretty | awesome. So we want to figure out how to use these and what | people can do with them and see where it takes us. | jpitz wrote: | They've been around quite a long time, but yeah it seems | like only SQL89 gets taught. | | Window Functions and CTEs are both major force multipliers | in the language, so I always encourage folks to go learn | them. | EE84M3i wrote: | Note that your second example has a misplaced pair of parens: | WITH temperature_delta_past_day AS ( | SELECT device_id, abs(val - lag(val) OVER (PARTITION BY | device_id ORDER BY ts)) as abs_delta FROM | measurements WHERE ts >= now() - '1 | day'::interval) calc_delta ) SELECT device | id, sum(abs_delta) as volatility FROM | temperature_delta_past_day GROUP BY device_id; | | should probably be WITH | temperature_delta_past_day AS ( SELECT | device_id, abs(val - lag(val) OVER (PARTITION BY device_id | ORDER BY ts)) as abs_delta FROM measurements | WHERE ts >= now() - '1 day'::interval ) | SELECT device id, sum(abs_delta) as volatility FROM | temperature_delta_past_day GROUP BY device_id; | jpitz wrote: | You're correct - I was in a hurry to express the idea, and | I failed to check the where clause. | djk447 wrote: | (NB: Post author here) | | We don't have many Haskell programmers, we mostly work in Rust, | C and Go, but we're always open to new things... | | This is pretty Postgres specific. From the beginning Postgres | has focused on extensibility and allowed this sort of stuff | with custom functions/operators/types. Many other SQL variants | don't have that. It's one of the main things that sets Postgres | apart from other databases, see Stonebreaker's great history of | this, specifically the stuff on Object Relational databases | [1]. | | We're pretty focused on building on top of Postgres because of | that functionality. We do have some other stuff to make | Postgres more scalable, and you're welcome to try us out, but | if there's something specific that BigQuery offers that you | need feel free to file a github issue around that too. But | yeah, no plans to do things like this in other databases, they | just don't have the infrastructure... | | [1]:https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/pap | e... | akulkarni wrote: | (TimescaleDB co-founder) | | Agree - what would you need to see in PostgreSQL / | TimescaleDB to enable you to switch from Big Query? We are | all ears :-) | methyl wrote: | Not the op, but infinite scalability without any | infrastructure work required besides loading the data is | what holds me on BigQuery | akulkarni wrote: | Thanks for sharing. | | "infinite scalability" is something we are aiming for in | our new vision for Timescale Cloud: | | https://blog.timescale.com/blog/announcing-the-new- | timescale... | | We are not there today, but we are making progress. | | Any other feedback - please let us know :-) | btown wrote: | To do something similar without custom extensions, it's worth | knowing about Postgres's lateral joins, which allow you to reuse | columns from the left side of your join in the right side. It | basically realizes the "enrich my results" promise that SQL | should always have had, and makes incredibly flexible analytics | queries possible without ever needing to indent more than once! | | https://heap.io/blog/postgresqls-powerful-new-join-type-late... | | https://ddrscott.github.io/blog/2017/what-the-sql-lateral/ | chrisjc wrote: | So does this functional SQL just get "transpiled" into analytical | SQL? If so, where does "transpilation" occur, in the client, | intercepted before hitting the query engine, etc? Or has the | query engine itself been modified to handle functional SQL? | | If the query engine itself has been modified, are there any | performance gains or losses? | | This is a pretty interesting and compelling idea. I do wonder if | the functional SQL approach might entice developers to write very | bad/expensive/ineffective queries. Then again, nothing is | stopping developers from doing the same with analytical SQL. | ccleve wrote: | Duplicating my question to Lockerman below. Sorry for the | double post. | | Is the -> operator just syntactic sugar? Is this: | | SELECT device_id, timevector(ts, val) -> sort() -> delta() -> | abs() -> sum() as volatility | | the same as this? | | SELECT device_id, sum(abs(delta(sort(timevector(ts, val))))) as | volatility | djk447 wrote: | Sorry I missed before was out grabbing lunch, answered above! | WireBaron wrote: | This is essentially correct. It's a little more complicated | inside, and the expansion's really more like: | | SELECT device_id, arrow_run_pipeline(timevector(ts, val), | arrow_add_element(sort(), arrow_add_element(delta(), | arrow_add_element(abs(), sum()))); | | The notable difference here is that this presents a lot more | optimization potential, as the entire pipeline can | conceivably be applied in one pass through the table. | djk447 wrote: | (NB: Post author here) | | So this is a great question and we may do a whole post on this | at some point in the future. There are definitely ways to shoot | yourself in the foot, but hopefully we can avoid at least some | of them. | | The main way this happens now is that the pipeline elements | actually return a custom data type that represents the composed | set of functions, and we can (for at least part of the | pipeline) handle simplification and combination of that into a | single set of functions to run over the `timevector`. | | With that said, there isn't great support for common | subexpression elimination in PG right now, we're hoping we can | solve some of that with both subqueries and maybe some fun | stuff in the planner/executor, but we're going to try to solve | that as we think about moving this from experimental into | stable. | | Right now we hardly modify the query engine at all, these are | just normal functions that are run and they'll probably remain | that way for a while. The optimization here will happen more | inside the datastructure or within a row to make sure we're not | scanning the same `timevector` multiple times ideally. We do | hope to add in some vectorization and other bits as we keep | going but not 100% where that's going to go yet. I'm also going | to ask a colleague to respond and clarify this cause he's | working more heavily on it and I know some of the bits but only | some of it... | chrisjc wrote: | > return a custom data type that represents the composed set | of functions, and we can (for at least part of the pipeline) | handle simplification and combination of that into a single | set of functions to run over the `timevector` | | I'm really having trouble wrapping my head around this idea. | I guess it speaks volumes about the potential and power of | PostgreSQL in general. | | Has PostgreSQL ever been used like this in the past, or is | this really thinking outside of the box? | | We don't use TimeScale or have any intention of in the | immediate future (Snowflake provides 99% of what we need | atm), but I'm always interested and impressed to hear what | they're doing. | djk447 wrote: | (NB: Post author here) | | Yeah. It's a bit mind bending I guess. Will try to explain | but might just need a longer format thing and we will | probably do a piece on this in the future. | | Essentially, each function returns a special | "PipelineElement" datatype that identifies the function, | but not what it's acting on. When you string them together, | it returns a "PipelineElement" that represents the | composition of all the functions with the correct ordering | etc. That can then be applied to the incoming `timevector`. | | As far as I know, no one's done this before. I don't think | it's a completely new idea, Stonebreaker thought about how | you can treat code as data and thought that bringing data | and code closer together was an important goal of the | Postgres project, but this specific way of doing it is | pretty new and out there I think! | | I think most people would think that in order to do this | you'd need to do something like modify the parser or | something more invasive. | | (We do use one planner hook on the function call, the | expression simplification one that allows us to convert | something like `timevector(ts, val) -> sort() -> delta() -> | abs() -> sum()` into `timevector(ts, val) -> (sort() -> | delta() -> abs() -> sum())` so that all the operations can | be done in a single pass. But that's pretty minor. If we | need to figure something out around the common | subexpression elimination we may need to hook into the | planner a little more, we'll see). | [deleted] | djk447 wrote: | Oh yeah, and building on this, this is also the source of | the ability to actually store the pipeline element in the | database, say in a column and apply it later that I | mentioned in a comment below. That's one of the cool | mind-bending things this allows. | | The example I gave below was building on the example in | the post, where we had thermometers. Except now think | about if you have thermocouples and each thermocouple had | its own calibration curve, and they were different types | so they might even have different types of equations | providing the calibration curve. | | You can now store the pipeline that does the correct | adjustments and calculates the actual temperature | separately for each `device_id` and JOIN to the | "calibrations" table to apply the calculation. | | It's kinda crazy, but really cool and I'm hoping to do a | post on that whole thing at some point to encourage | people to play around with it and see what sorts of | things it allows... | derefr wrote: | For your next magic trick, I'd suggest using these to | build a Postgres equivalent to Lucene's persisted | levenstein automata :) | [deleted] | [deleted] | Lockerman wrote: | (Timescale engineer here) | | To summarize a bit on what David said here[1]: there are no | modifications to the query engine, this is all using Postgres's | custom operator support. | | https://news.ycombinator.com/item?id=28920110 | ccleve wrote: | Is this just syntactic sugar, then? | | Is this: | | SELECT device_id, timevector(ts, val) -> sort() -> delta() -> | abs() -> sum() as volatility | | the same as this? | | SELECT device_id, sum(abs(delta(sort(timevector(ts, val))))) | as volatility | djk447 wrote: | (NB Post author) | | Kinda. It's close to that, except in order to make it do | that we had to actually make the functions return special | types so it's more equivalent to something like | | SELECT device_id, apply(sum, apply(abs, apply(delta, | apply(sort, timevector(ts, val)))))) | | Where each of the items in there is an instance of our | special "PipelineElement" type, which defines the function | it's applying. | | Does that make any sense at all? Not sure if I'm explaining | this well... | ccleve wrote: | It does make sense. Thanks. | stingraycharles wrote: | I can assume that this special syntax allows for better | optimizations; the SQL variant is much more flexible, which | means you can make much less assumptions, and thus less room | for optimizations. | | Not sure if they're doing that (especially for the v1 version | of what looks to be primarily a syntactic selling point, rather | than a perf selling point). But that's what I would do. | djk447 wrote: | (NB: post author here) | | Yep! you got it exactly right! We're doing it for syntax | first and then hopefully go back and start doing interesting | performance optimizations on top. There are definitely some | things we're thinking about. | chrisjc wrote: | Wow, great explanation! Now I really like this idea! | ccleve wrote: | It appears that all these extra functions are written in Rust, | but the core Timescale product is written in C. Why did you do it | that way? Would you do it differently if you were starting from | scratch today? | djk447 wrote: | (NB: Post author here!) | | This is all true! We have a bit more info on that choice in | this post [1]. But in general the Toolkit extension [2] is | meant to have a bit of a lighter touch than the core | TimescaleDB extension [3], it's meant for moving faster and | working almost entirely with the custom | functions/types/operators type approach rather than the core | extension which is integrated much more deeply into the planner | & executor and deals with a number of other hooks. It would be | possible to do that in Rust, but because it's so deeply | integrated with the main Postgres codebase there would just be | a lot of translation going on and you'd get a lot less of the | benefits of working in Rust. So C is more natural for much of | the core DB extension. | | PGX [4] is also a huge help for writing this sort of extension. | The point of Toolkit is really to move much faster, try lots of | things, mostly using the more SQL facing API (CREATE FUNCTION, | CREATE AGGREGATE and the like). And PGX helps manage all of | that really nicely. The rest of Rust being available is also | great and we like working in the language, but without PGX it'd | be a lot harder and we couldn't move nearly as fast. | | [1]: https://blog.timescale.com/blog/introducing- | hyperfunctions-n... [2]: | https://github.com/timescale/timescaledb-toolkit [3]: | https://github.com/timescale/timescaledb [4]: | https://github.com/zombodb/pgx ___________________________________________________________________ (page generated 2021-10-19 23:00 UTC)