[HN Gopher] Friendlier SQL with DuckDB ___________________________________________________________________ Friendlier SQL with DuckDB Author : hfmuehleisen Score : 271 points Date : 2022-05-12 14:50 UTC (8 hours ago) (HTM) web link (duckdb.org) (TXT) w3m dump (duckdb.org) | aerzen wrote: | If anyone is interested in improvements to SQL, checkout PRQL | https://github.com/prql/prql, a pipelined relational query | language. | | It supports: | | - functions, | | - using an alias in same `select` that defined it, | | - trailing commas, | | - date literals, f-strings and other small improvements we found | unpleasant with SQL. | | https://lang.prql.builders/introduction.html | | The best part: it compiles into SQL. It's under development, | though we will soon be releasing version 0.2 which would be "you | can check it out"-version. | lijogdfljk wrote: | This is neat. Have you found this new found capability at odds | with "good SQL"? Eg, i run a fairly large application that has | a huge DB schema, and more often than not when the SQL gets | huge and ugly it often means we're asking too much of the DB. | "Too much" being more easy to run into poor indexes, giving | more chances for it to pull in unexpectedly large number of | rows, etc. | | My fear with PRQL is that i'd more easily ask too much of the | DB, given how easy it looks to write larger and more complex | SQL. Thoughts? | aljazmerzen wrote: | That's true - when you hit 4th CTE you are probably doing | something wrong. | | But not always. Some analytical queries may actually need | such complexity. Also, during development, you would | sometimes pick only first 50 rows before joining and | grouping, with intention of not overloading the db. To do | this you need a CTE (or nested select), but in PRQL you just | add a `take 50` transform to the top. | eyelidlessness wrote: | Often efforts and articles like this feel like minor affordances | which don't immediately jump out as a big deal, even if they | eventually turn out to be really useful down the road. Seeing the | article title, that's what I expected. I did not expect to read | through the whole thing with my inner voice, louder and more | enthusiastically, saying "yes! this!" Very cool. | Oxodao wrote: | Came across this a few time but never got to try it out because | the only golang binding is unofficial and I can't get CGO to work | as expected... | | That would be really neat to have an official one. This articles | makes me want to try it even more | 1egg0myegg0 wrote: | Here is a solved Github Issue related to CGO for the Go | bindings! If you have another issue, please feel free to post | it on their Github page! | | https://github.com/marcboeker/go-duckdb/issues/4 | Oxodao wrote: | Thanks! I didn't see that, I'll give it a try again! | nlittlepoole wrote: | Ran into some similar issues with those bindings. We switched | to using the ODBC drivers and its been great. Those are | official and we just use an ODBC library in go to connect | carlineng wrote: | I love these updates. It would be great to see some of the major | data warehouse vendors (Snowflake, BigQuery, Redshift) follow | suit. | eis wrote: | I was just yesterday exploring DuckDB and it looked very | promising but I was very surprised to find out that indexes are | not persisted (and I assume that means they must fit in RAM). | | > Unique and primary key indexes are rebuilt upon startup, while | user-defined indexes are discarded. | | The second part with just discarding previously defined indexes | is super surprising. | | https://duckdb.org/docs/sql/indexes | | This was an instant showstopper for me or I assume most people | whose databases grow to a bigger size at which point an OLAP DB | becomes interesting in the first place. | | Also the numerous issues in Github regarding crashes make me | hesitant. | | But I really like the core idea of DuckDB being a very simple | codebase with no dependencies and still providing very good | performance. I guess I just would like to see more SQLite-esque | stability/robustness in the future and I'll surely revisit it at | some point. | 1egg0myegg0 wrote: | Persistent indexes are being actively worked on! Stay tuned. As | for the crashes - DuckDB is very well tested and used in | production in many places. The core functionality is very | mature! Let us know if you test it out! Happy to help if I can. | | (disclaimer - on the DuckDB team) | eis wrote: | Hi, good to hear that you guys care about testing. One thing | apart from the Github issues that led me to believe it might | not be super stable yet was the benchmark results on | https://h2oai.github.io/db-benchmark/ which make it look like | it couldn't handle the 50GB case due to a out of memory | error. I see that the benchmark and the used versions are | about a year old so maybe things changed a lot since then. | Can you chime in regarding the current story of running | bigger DBs like 1TB on a machine with just 32GB or so RAM? | Especially regardung data mutations and DDL queries. Thanks! | 1egg0myegg0 wrote: | Yes, that benchmark result is quite old in Duck years! :-) | | We actually run that benchmark as a part of our test suite | now, so I am certain that there is improvement from that | version. | | The biggest DuckDB I've used so far was about 400 GB on a | machine with ~250 GB of RAM. | | There is ongoing work that we are treating as a high | priority for handling larger-than-memory intermediate | results within a query. But we can handle larger than RAM | in many cases already - we sometimes run into issues today | if you are joining 2 larger than RAM tables together | (depending on the join), or if you are aggregating a larger | than RAM table with really high cardinality in one of the | columns you are grouping on. | | Would you be open to testing out your use case and letting | us know how it goes? We always appreciate more test cases! | pholanda wrote: | Hey eis, you are correct. We do not support index storage in | our latest release. I am currently implementing this, and it is | in a fairly advanced stage. So it should be featured in the | next release. | | This took a little while because we use a fairly modern index | structure with no literature definition on how to buffer manage | it. | eis wrote: | It's good to hear that persistent indexes are coming soon. I | saw it was on the roadmap but didn't know how far out this | feature was. Do you have an idea when that release could be | out? | | BTW Do you have some kind of code/docs one can take a look at | regarding the index structure? I'm a part-time data structure | nerd :) | 1egg0myegg0 wrote: | Here is a paper that was used when building the DuckDB | approach! | | https://db.in.tum.de/~leis/papers/ART.pdf | | There are some other papers and details about the | architecture here: https://duckdb.org/why_duckdb | flakiness wrote: | I love the attitude towards ergonomics over standard compliance. | And you'll see why SQL has never been really portable across | databases ;-) | aerzen wrote: | Well, it's not portable, but you don't have to learn a new | language every time you encounter a new DBMS (unlike MongoDB or | InfluxDB). | db65edfc7996 wrote: | Maybe if SQL was a better language at the start, there would be | more incentive to follow the spec. | throwawayboise wrote: | SQL is an amazingly good language for what it does. It has | been with us coming up on 5 decades. | lijogdfljk wrote: | Minus everyone having different versions of it to | facilitate the missing functionality/needs of users | zasdffaa wrote: | IIRC there was am ANSI/FIPS standard which was dropped under | the bill clinton's administration, which is when things | started to diverge. | | (Info from memory, may be wrong or a bit mangled) | wenc wrote: | This is fantastic. Column aliases are super helpful in reducing | verbose messiness. | | DuckDB has all but replaced Pandas for my use cases. It's much | faster than Pandas even when working with Pandas data frames. I | "import duckdb as db" more than I "import pandas as pd" these | days. | | The only thing I need now is a parallelized APPLY syntax in | DuckDB. | 1egg0myegg0 wrote: | Fugue has a DuckDB back end and I believe they can actually use | Dask and DuckDB in combination for what I believe is similar to | what you are looking for! There is also a way to map Python | functions in DuckDB using the relational (dataframe-like) API. | | https://fugue-tutorials.readthedocs.io/tutorials/integration... | | https://github.com/duckdb/duckdb/pull/1569 | projektfu wrote: | I find that the examples are very confusing because they are | using names that sound like rows or tables (jar_jar_binks, | planets) as fields in the examples. | 1egg0myegg0 wrote: | Ah, well, that was a risk that I took... Thank you for the | feedback though! The Star Wars puns were too hard to resist... | | If you have a specific question, definitely post it here and I | will clarify! | petepete wrote: | I'd agree. Took me a couple of reads to make sense of it. | | Keep the puns, I just think with a bit of adjustment the | examples would be easier to understand. | mattrighetti wrote: | `EXCLUDE` | | Extremely useful, is there a reason why this is something not | implemented in SQL in the first place? I often find myself | writing very long queries just to select basically all columns | except for two or three of them. | heyda wrote: | because columns can be added to tables in production databases, | so any time you use select * you run the chance the number of | columns changing and breaking anything you wrote. | wruza wrote: | This seems reasonable on its own, but then you can add a | compound index and forget to join on a second part, or | refactor a column in two and only collect one value into | aggregation. This spotted babysitting is just stupid. If | you're anxious about query integrity, get some tooling and | check your sqls/ddls against some higher-level schema. | | Even if that turns out to be a constant source of trouble | worth _not_ having, then why SQL can't provide columnsets at | least, so that queries could include, group or join on these | predefined sets of columns instead of repeating tens of | columns and /or expressions and/or aggregations many times | across a single query. You had employees.bio_set=(name, dob), | now you add `edu` to it and it just works everywhere, because | you think in sets rather than in specific columns. Even group | by bio_set works. Heck, I bet most of ORMs partially exist | only to generate SQL, because it's sometimes unbearable as | is. | justin_oaks wrote: | That's a problem with select * in general, not a problem with | using EXCLUDE with select *. So that still doesn't explain | why it's not in SQL to begin with. | munk-a wrote: | I've always viewed SELECT * as a convinence for schema | discovery and a huge bonus for subqueries - our shop | excludes its use at the top level in production due to the | danger of table definitions changing underneath... but we | happily allow subqueries to use SELECT * so long as that | column list is clearly defined before we leave the | database. | | Worst, by far, than a column you didn't expect being added | is a column you did expect being _removed_. Depending on | how thorough your integration tests are (and ideally they | should be pretty thorough) you could suddenly start getting | strange array key access (or object key unfound) errors | somewhere on the other side of the codebase. | throwawayboise wrote: | Yeah I tend to use "select *" in interactive queries when | I'm working out what I want, but then write explicit | column names in anything going into production. This | helps with the column-being-removed case, as the query | will fail immediately selecting a nonexistent column, | whereas "select *" will not fail and the error will | happen somewhere else. | heyda wrote: | "A traditional SQL SELECT query requires that requested | columns be explicitly specified, with one notable | exception: the * wildcard. SELECT * allows SQL to return | all relevant columns. This adds tremendous flexibility, | especially when building queries on top of one another. | However, we are often interested in almost all columns. In | DuckDB, simply specify which columns to EXCLUDE:" | | It appears how this works is that is selects all columns | and then EXCLUDES only the column's specified, the reason | this doesn't exist in normal SQL is because it is a | terrible idea. This is something that will break at many | companies with large technical debt if it is ever used. | gmueckl wrote: | That's why good database wrappers support referencing columns | in result sets by column name. It's good practice. | wolf550e wrote: | That 750KB PNG can probably be a 50KB PNG. Even without resizing | it compresses to less than half its size. | | https://duckdb.org/images/blog/duck_chewbacca.png | 1egg0myegg0 wrote: | Thanks! Can you tell that my SQL-fu is stronger than my HTML- | fu? :-) Much appreciated! | andai wrote: | In this case it should probably be a JPEG? (Unless it has a | transparent background _and_ the site responds to the user 's | dark-mode setting? :) Also, this image looks like it almost | certainly _was_ a JPEG, at some point!) | tracker1 wrote: | The color palette is pretty limited, so using something | that can have a specific/limited palette (say 48-64 color | in this case) is probably going to have a better result | than jpeg. Also, optimizing for display size would take it | further still. Alpha transparency support is also a bonus | for png over jpeg. | getravi wrote: | I would go even further and say that "GROUP BY ALL" and "ORDER BY | ALL" should be implied if not provided in the query. | | EDIT: Typo | foxbee wrote: | This is awesome and would love to chat around building an | integration to the low-code platform Budibase: | https://github.com/Budibase/budibase | parentheses wrote: | Though many of the queries don't make complete sense the mapping | of queries to Star Wars is :chefkiss: | Timpy wrote: | Wow this was definitely a pessimistic click for me, I was | thinking "trying to replace SQL? How stupid!" But it just looks | like SQL with all the stuff you wish SQL had, and some more stuff | you didn't even know you wanted. | zasdffaa wrote: | This looks a little odd SELECT age, | sum(civility) as total_civility FROM star_wars_universe | ORDER BY ALL -- ORDER BY age, total_civility | | there's no GROUP BY? | | edit: (removed edit, I blew it, sorry) | 1egg0myegg0 wrote: | Good catch! Fixed! Needed a group by all in there. | _raoulcousins wrote: | I love love love DuckDB. When I can use DuckDB + pyarrow and not | import pandas, it makes my day. | pxtail wrote: | Wow so many nice database-related news recently - feels like | database week or something! :) | roncohen wrote: | Lots of great additions. I will just highlight two: | | _Column selection_ : | | When you have tons of columns these become useful. Clickhouse | takes it to the next level and supports APPLY and COLUMN in | addition to EXCEPT, REPLACE which DuckDB supports: | - APPLY: apply a function to a set of columns - COLUMN: | select columns by matching a regular expression (!) | | Details here: https://clickhouse.com/docs/en/sql- | reference/statements/sele... | | _Allow trailing commas_ : | | I can't count how many times I've run into a problem with a | trailing comma. There's a whole convention developed to overcome | this: the prefix comma convention where you'd write: | SELECT first_column ,second_column | ,third_column | | which lets you easily comment out a line without worrying about | trailing comma errors. That's no longer necessary in DuckDB. | Allowing for trailing commas should get included in the SQL spec. | 1egg0myegg0 wrote: | Thank you for the feedback! I will check those Clickhouse | features out. I totally agree on the trailing commas, and I use | commas first syntax for that same reason! But maybe not | anymore... :-) | go_prodev wrote: | EXCEPT columns would get my vote for ansi standard SQL | adoption. So much time is spent selecting all but a few | columns. | gregmac wrote: | Yes, trailing commas should work everywhere! | | JSON is the other one where it annoys me, but luckily I rarely | hand-write any JSON anymore (and there are semi-solutions for | this like json5). | | In code I always add trailing commas to anything comma- | separated. It makes editing simpler (you can shuffle lines | without thinking about commas). In a diff or blame it doesn't | show adding a comma as a change. | | SQL is the one spot where this doesn't work, and it's a | constant foot-gun as I often don't remember until I run and get | a syntax error. | skrtskrt wrote: | JSONC allows comments and trailing commas, but adoption seems | to be low. | | VSCode uses it for configuration, but when I wanted to use it | in Python (to add context to source-controlled Elasticsearch | schemas) there were only a couple old barely-maintained | libraries for parsing. | yunohn wrote: | > there were only a couple old barely-maintained libraries | for parsing. | | Do they work, though? If it's a mostly stable standard, | doesn't seem like you'd need a frequently updated parser. | nicoburns wrote: | > Allowing for trailing commas should get included in the SQL | spec. | | Yep! That would be my #1 request for SQL. Seems ridiculous that | it's not supported already. | _dark_matter_ wrote: | BigQuery also supports trailing commas! | skeeter2020 wrote: | You can do the same thing with your WHERE clause and ANDs by | always starting them WHERE 1=1 as well. | | >> Allowing for trailing commas should get included in the SQL | spec. | | So there is no "SQL spec" per se, there's an ANSI specification | with decades of convention and provider-specific customizations | piled on top. This support for trailing commas is the best | you're going to get. | snidane wrote: | Allow referencing columns defined previously in the same query | would make duckdb competitive for data analytics. Without that | one has to chain With statements for just the tiniest | operations. select 1 as x, x + 2 as y, y/x as | z; | zasdffaa wrote: | (nothing to do with DuckDB but..) SQL is complex enough, and | allowing this (and acyclically as mentioned below) would do | my $%^& nut implementing it. | | But I know a user requirement when I hear one, so can you | give me an large, real example of where allowing this would | make things easier? That would be mega helpful, ta | wruza wrote: | _SQL is complex enough_ | | No, it is not. I mean it is, but not in parts where that | could be seen as useful and/or convenient. [A]cyclic graph | traversal/etc is one of the basic tests in a modern | interview at any CRUD studio. How come it could do $%^& to | any part of yours? | zasdffaa wrote: | > How come it could do $%^& to any part of yours? | | Because just implementing the standard stuff nearly did | my $^&% nut. Also I know about graphs & posets, and it's | potentially a little more complex than it seems. The | variables select x * x as y, 1 as x | | is meh, but what about select | (select tbl.z from tbl where tbl.y = y) as subq, | x * yy as y, xx + 1 as x, subq + | yy as zzz from ( select xx, yy | from ... ) | | I just don't fancy supporting that. | wruza wrote: | _what about_ | | That's no different than the first snippet, if you aren't | parsing it with regexps, of course. The resulting AST | identifiers would simply refer to not only column names, | but also to other defined expressions. This is the case | for both snippets. It's either cyclic or not, and when | not, it is easy to substitute/cse/etc as usual. The | complexity of these expressions is irrelevant. | yread wrote: | for example select id, count(...something | complicated) as complicated_count from .... | order by complicated_count | | would help | jsmith99 wrote: | 'ORDER BY 2' would work here, but using the named column | is a lot nicer. | tomjakubowski wrote: | Wow, TIL. Great tip for those random one-off queries you | have to bash out when investigating a problem. | 1egg0myegg0 wrote: | Yes, good thought! That is listed at the bottom of the | article as something we are looking at for the future. | flakiness wrote: | There is a bug for that and it looks someone is even working | on it. https://github.com/duckdb/duckdb/issues/1547 | karmakaze wrote: | There's also no need to make it left to right usage, as | long as it's acyclic: select y-2 as x, 3 as | y, y/x as z; | sonthonax wrote: | Would this be compiled into a graph of subqueries and | window statements? | karmakaze wrote: | I'm not following, the original could be written as | select x + 2 as y, 1 as x, y/x as z; | | with the same column values in a different order. Order | of arguments shouldn't matter is all I was saying. | gigatexal wrote: | Yep! Agreed! | IshKebab wrote: | Matching columns by regular expression sounds like a _terrible_ | feature. Talk about bug-prone! | tosh wrote: | How does DuckDB compare to SQLite (e.g. which workloads are a | good fit for what? Would it be a good idea to use both?) | | I found https://duckdb.org/why_duckdb but I'm sure someone here | can share some real world lessons learned? | eatonphil wrote: | DuckDB: embedded OLAP, SQLite: embedded OLTP. For small | datasets (<1M rows let's say) either would be similar in | performance. | | I need to do the benchmarks to substantiate this but this is my | intuition. | enjalot wrote: | one thing I love about DuckDB is that it supports Parquet | files, which means you can get great compression on the data. | Here's an examples getting a 1 million row CSV under 50mb and | interactive querying in the browser: | https://observablehq.com/@observablehq/bandcamp-sales-data?c... | | the other big thing is better native data types, especially | dates. With SQLite if you want to work with timeseries you need | to do your own date/time casting. | 1egg0myegg0 wrote: | Yes, it is always difficult to use dates in SQLite... DuckDB | makes dates easier - like they should be! | 1egg0myegg0 wrote: | Excellent question! I'll jump in - I am a part of the DuckDB | team though, so if other users have thoughts it would be great | to get other perspectives as well. | | First things first - we really like quite a lot about the | SQLite approach. DuckDB is similarly easy to install and is | built without dependencies, just like SQLite. It also runs in | the same process as your application just like SQLite does. | SQLite is excellent as a transactional database - lots of very | specific inserts, updates, and deletes (called OLTP workloads). | DuckDB can also read directly out of SQLite files as well, so | you can mix and match them! | (https://github.com/duckdblabs/sqlitescanner) | | DuckDB is much faster than SQLite when doing analytical queries | (OLAP) like when calculating summaries or trends over time, or | joining large tables together. It can use all of your CPU cores | for sometimes ~100x speedup over SQLite. | | DuckDB also has some enhancements with respect to data transfer | in and out of it. It can natively read Pandas, R, and Julia | dataframes, and can read parquet files directly also (meaning | without inserting first!). | | Does that help? Happy to add more details! | OskarS wrote: | One of SQLite's most appealing aspects to me is using it as | an application file format, as described in this article: | https://www.sqlite.org/appfileformat.html | | How does DuckDB compare in that aspect? Does it have the same | kind of guarantees of robustness, incorruptibility and | performance (especially reading/writing binary blobs) that | SQLite does? | | In any case: DuckDB looks great, nice work! Good to have more | players in this space! | 1egg0myegg0 wrote: | It is a goal of ours to become a standard multi-table | storage format! However, today we are still in beta and | have made some breaking changes in the last few releases. | (Exporting from the old version, then reimporting your DB | in the new allows you to upgrade!) Those should happen less | often as we move forward (the storage format was | genericized a bit and is more resilient to future | enhancements now), and locking in our format amd | guaranteeing backwards compatibility will occur when we go | to 1.0! | 1egg0myegg0 wrote: | I should add that we can read/write Apache Arrow as well! | tosh wrote: | Thanks, it does help! I understand SQLite might be | better/ideal for OLTP (?) but would DuckDB also work for use | cases where I query for specific records (e.g. based on | primary key) or would I rather use SQLite for OLTP stuff and | then read SQLite from DuckDB for analytical workloads? | | Basically I'm wondering: if I go all in on DuckDB instead of | SQLite would I notice? Do I have to keep anything in mind? | | I know, probably difficult to answer without a concrete | example of data, schema, queries and so on. | | The SQL query features in the article seem really neat. Kudos | @ shipping. | 1egg0myegg0 wrote: | Good questions! You are correct that it depends. We do have | indexes to help with point queries, but they are not going | to be quite as fast as SQLite because DuckDB stores data in | a columnar format. (Soon they will be persistent - see | comments above!) That columnar format is really great for | scanning many items, but not optimal for grabbing all of a | single row. | | With DuckDB, bulk inserts are your friend and are actually | super fast. | | Definitely let us know what you find! Just open up a | discussion on Github if you'd like to share what you find | out: https://github.com/duckdb/duckdb/discussions | ergocoder wrote: | Sqlite's SQL is severely limited. | | If you want to do something a bit more complex, you will have a | bad time. Hello! With recursive. | IshKebab wrote: | I haven't used DuckDB yet but the biggest differences I've | discovered if you aren't working on huge datasets where the | column/row thing makes a difference (you're probably not) are: | | 1. SQLite has a great GUI and is really really widely | supported. | | 2. DuckDB is properly statically typed with a much wider range | of types than SQLite, which is dynamically typed and only | _just_ added support for any kind of type checking at all. | 1egg0myegg0 wrote: | If you'd like to work with DuckDB in a SQL IDE/GUI, we | recommend DBeaver! It uses the DuckDB JDBC connector. A quick | how to guide is here: | https://duckdb.org/docs/guides/sql_editors/dbeaver | chrisjc wrote: | What are some potential long-term liabilities we might see in | choosing to adopt duckdb today? | | Obviously there will be a desire to monetize this project, if not | for the very simple reason of subsidizing the cost of its | development and maintenance. I love everything I hear and see | about this project, but it makes me nervous to recommend this | internally due to it not only being in such an early stage, but | also bc of any unforeseen costs and liabilities that it might | introduce in the future. | 1egg0myegg0 wrote: | Let me see if I can assuage some of your concerns! | | First off - DuckDB is MIT licensed, so you are welcome to use | and enhance it essentially however you please! | | DuckDB Labs is a commercial entity that offers commercial | support and custom integrations. (https://duckdblabs.com/). If | the MIT DuckDB works for what you need, then you are all set no | matter what! | | However, much of the IP for DuckDB is owned by a foundation, so | it is independent of that commercial entity. | (https://duckdb.org/foundation/) | | Does that help? Happy to answer any other questions! | learndeeply wrote: | Since the DuckDB people are here, just want to say that what | you're doing is going to be a complete game-changer in the next | few years, much like SQLite changed the game. Thanks for making | it open source! ___________________________________________________________________ (page generated 2022-05-12 23:00 UTC)