[HN Gopher] DuckDB-Wasm: Efficient analytical SQL in the browser ___________________________________________________________________ DuckDB-Wasm: Efficient analytical SQL in the browser Author : ankoh Score : 172 points Date : 2021-10-29 14:58 UTC (8 hours ago) (HTM) web link (duckdb.org) (TXT) w3m dump (duckdb.org) | typingmonkey wrote: | Does DuckDB support multi tab usage? How big is the wasm file | that must be loaded? | ankoh wrote: | The WebAssembly module is 1.6 - 1.8 MB brotli-compressed | depending on the Wasm feature set. We're currently | investigating ways to reduce this to around 1 MB. We further | use streaming instantiation which means that the WebAssembly | module will be compiled while downloading it. But still, it | will hurt a bit more than a 40KB library. | | Regarding multi-tab usage: Not today. The available filesystem | apis make it difficult to implement this right now. We're | looking into ways to make DuckDB-Wasm persistent but we can | only read in this release. | domoritz wrote: | On https://shell.duckdb.org/versus, we have a comparison with | related libraries. The WASM bundles currently is 1.8 MB but it | can be instantiated while it's streaming in. | | The size probably makes it prohibitive to use DuckDB when your | dataset is small and download size matters but we hope that | future improvements in WebAssembly can get the size down. | [deleted] | timwis wrote: | Interesting.. Would this be effective at loading a remote CSV | file with a million rows, then performing basic GROUP BY COUNTs | on it so I can render bar charts? | | I've been thinking of using absurd-sql for it since I saw | https://news.ycombinator.com/item?id=28156831 last week | texodus wrote: | I contribute to https://perspective.finos.org/ , supports all | of this and quite a lot more. | | Here's 1,000,000 rows example I just threw together for you | | https://bl.ocks.org/texodus/3802a8671fa77399c7842fd0deffe925 | | and a CSV example, you try yours right now | | https://bl.ocks.org/texodus/02d8fd10aef21b19d6165cf92e43e668 | ankoh wrote: | It depends. | | Querying CSV files is particularly painful over the network | since we still have to read everything for a full scan. | | With Parquet, you would at least only have to read the columns | of group by keys and aggregate arguments. | | Try it out and share your experiences with us! | elmolino89 wrote: | Not really DuckDB-Wasm question but DuckDB: | | I got a data sets probably not suitable for loading into a memory | table (close to 1000M rows CSV). I did split it into 20M rows | chunks, read one by one into a DuckDB temporary table and | exported as parquet. | | SELECT using glob prefix.*.parquet where mycolumb=foobar does | work but can be a bit faster. Apart from sorting the input to | parquet CSVs, what can he done? The CSV chunks were already | sorted. | fnord77 wrote: | this looks so cool. is there pre-loaded a demo page loaded with | tables so people can try out queries right away? | 1egg0myegg0 wrote: | If you head over to the shell demo, you can run a query like | the one below! | | https://shell.duckdb.org/ | | select * from 'https://shell.duckdb.org/data/tpch/0_01/parquet/ | orders.parqu...' limit 10; | aynyc wrote: | I wish it supports ORC. | crimsoneer wrote: | I'm still not sure I "get" the use case for DuckDB. From what I | understand, it's like a nifty, in-memory SQL, but why is that | better than just running PostGRES or Microsoft SQL server | locally, where your data structures and tables and stuff have a | lot more permanence? | | Like, my workflow is either I query an exiting remote corporate | DB and do my initial data munging there, or get givne a data dump | that I either work on directly in Pandas, or add to a local DB | and do a little more cleaning there. Not at all clear how Duck DB | would hel | nicoburns wrote: | DuckDB is columnar, so in theory a lot faster than Postgres or | SQL server for Analytical workloads. | | DuckDB is to Clickhouse, TimescaleDB, Redshift, etc as SQLite | is to Postgres, MySQL, SQL Server. | deadliftpro wrote: | From where do you get that sql server does not support | columnar? That is a wrong claim. | wenc wrote: | I don't think the OP said that SQL Server doesn't support | columnar, only that by analogy SQL Server is primarily a | row store (which for most of its history was true). | | Columnar technology in SQL Server only became usable in SQL | 2016 (it existed in 2012 but was too restrictive -- I know | because I tried to use it). | | In 2016 and above, you can either create a columnar index | (non clustered column store) or convert an entire table | into a columnar table (clustered column store). The | technology is actually pretty impressive and I've used it | in production where I have row stores coexisting with | column stores within the same database. | Dayshine wrote: | The simple answer noone else seems to have mentioned: SQLLite | has quite a low limit on the number of columns it supports, | which is a problem for data analytics which often prefers wide | over long. | 1egg0myegg0 wrote: | Check out this post for some comparisons with Pandas. | | https://duckdb.org/2021/05/14/sql-on-pandas.html | | DuckDB is often faster than Pandas, and it can handle larger | than memory data. Plus, if you already know SQL, you don't have | to become a Pandas expert to be productive in Python data | munging. Pandas is still good, but now you can mix and match | with SQL! | catawbasam wrote: | Not just in-memory. It's pretty convenient if you have a set of | Parquet files with common schema. Fairly snappy and doesn't | have to fit in memory. | jamesrr39 wrote: | I'm using duckdb for querying parquet files as well. It's an | awesome tool, so nice to just "look into" parquet files with | SQL. | deshpand wrote: | Many enterprises are coming up with patterns where they | replicate the data from the database (say Redshift) into | parquet files (data lake?) and directing more traffic | including analytical workloads onto the parquet files. | | duckdb will be very useful here, instead of having to use | Redshift Spectrum or whatever. | deshpand wrote: | I work heavily with pandas and dask (when you want to use | multiple cores), using parquet files for storage. We see a lot | of benefits in selectively bringing in duckdb into the mix. For | instance, the joins are extremely slow with both pandas and | dask and require a lot of memory. That's a situation where | using duckdb reduces the memory needs and speeds things up a | lot. | | And we may not want to upload the data into postgres or another | database. We can just work with parquet files and run in- | process queries. | mytherin wrote: | DuckDB developer here. DuckDB is a regular RDBMS that has | persistent ACID storage, but is tuned towards analytical | workloads, i.e. read-heavy workloads with aggregates that | require full scans of the data. Any data you write to tables is | stored persistently on disk, and not all your data needs to fit | in memory either. | | Our tagline is "SQLite for analytics", as DuckDB is an in- | process database system similar to SQLite that is geared | towards these types of workloads. | | DuckDB has a flexible query engine, and also has support for | directly running SQL queries (in parallel!) on top of Pandas | [1] and Parquet [2] without requiring the data to be imported | into the system. | | [1] https://duckdb.org/2021/05/14/sql-on-pandas.html | | [2] https://duckdb.org/2021/06/25/querying-parquet.html | brandmeyer wrote: | Maybe this is a silly question: Why is the A/B choice between | a row-major database and a column-major database, instead of | between row-major tables and column-major tables within a | flexible database? | | What's stopping the other leading brands from implementing | columnar storage, queries, and such with a COLUMN MAJOR table | attribute? | dgudkov wrote: | SQL calculations on columnar data are quite different from | row-based databases, so its effectively a different | database engine. You can take multiple advantages of | columnar data store, because it usually employs a form of | vocabulary compression. For instance, obtaining distinct | values of a field in a columnar DB is much faster because | it's typically just the vocabulary of the field, so it | doesn't even require a full table scan. Many other columnar | computations such as filtering or aggregation can be done | on compressed data without decompression. | mytherin wrote: | Some databases do offer both, but it is much more involved | than just changing the storage model. The entire query | execution model needs to adapt to columnar execution. You | can simulate a column store model in a row-store database | by splitting a table into a series of single-column tables, | but the performance benefits you will capture are much | smaller than a system that is designed and optimized for | column store execution. | loxias wrote: | To me, the use case is really obvious: when you reached for | SQLite but now want something with Moar Powah. | | Now I've reduced it to a previously answered question: what's | the use case for SQLite? ;) | https://www.sqlite.org/whentouse.html | | That being said, I don't see the point, and shudder at the idea | of a web page's javascript doing anything which _needs_ | noticeable amounts of the CPU, but I 'm a non-standard user... | kf6nux wrote: | I don't fully get the use case either, but it's in a different | category than Postgres or Microsoft SQL because it runs in the | browser and can be made part of your web app. | ankoh wrote: | DuckDB-wasm is targeting the browser so it's not directly | competing with Pandas (that's the job of native DuckDB). | | It's targeting use cases where you want to push analytical | computation away from servers into the client (browser). | | Lets me sketch 2 examples: | | A) You have your data sitting in S3 and the user-specific | data is in a browser-manageable area. | | (E.g. this paper from Tableau research actually states | dataset sizes that should fall into that category: https://re | search.tableau.com/sites/default/files/get_real.pd...) | | In that scenario, you could eliminate your central server if | your clients are smart enough. | | B) You are talking about larger dataset sizes (GB) and want | to explore them ad-hoc in your browser. | | Uploading them is unrealistic and installing additional | software is no longer ad-hoc enough. | sedatesteak wrote: | I got excited about duckdb recently too. Used it yesterday for a | new project at work and immediately ran into a not implemented | exception for my (awful) column naming structure and discovered | there is no pivot function. | | Otherwise, it's great, but obviously still a wip. | | For those wondering, I have a helper function for soql queries | for salesforce that follows the structure object.field | | Referring to a tablealias.[column.name] or quotes instead of | brackets was a no go. | sgarrity wrote: | How does this compare/relate to https://jlongster.com/future-sql- | web (if at all)? | ankoh wrote: | The author outlines many problems that you'll run into when | implementing a persistent storage backend using the current | browser APIs. | | We faced many of them ourselves but paused any further work on | an IndexedDB-backend due to the lack of synchronous IndexedDB | apis (e.g. check the warning here | https://developer.mozilla.org/en- | US/docs/Web/API/IDBDatabase...). He bypasses this issue using | SharedArrayBuffers which would lock DuckDB-Wasm to cross- | origin-isolated sites. (See the "Multithreading" section in our | blog post) | | We might be able to lift this limitation in the future but this | has some far-reaching implications affecting the query | execution of DuckDB itself. | | To the best of my knowledge, there's just no way to do | synchronous persistency efficiently right now that wont lock | you to a browser or cross-origin-isolation. But this will be | part of our ongoing research. | munro wrote: | Cool! This is the first time hearing about DuckDB, exciting as I | heavily use SQLite. And these benchmarks are showing it's 6-15 | times faster than sql.js (SQLite) [1], along with another small | benchmark I found [2]. I usually just slather on indexes in | SQLite tho, so indexed queries may not stand up as well; and | might not be as fast when it's on storage, as this is comparing | in memory performance (I think?), but I'll give it a spin! | | Gonna throw this out there: main thing I'm looking for from an | embedded DB is better on disk compression; I've been toying with | RocksDB, but it's hard to tune optimally & it's really too low | level for my needs. | | [1] > ipython import numpy as np duckdb | = [0.855, 0.179, 0.151, 0.197, 0.086, 0.319, 0.236, 0.351, 0.276, | 0.194, 0.086, 0.137, 0.377] sqlite = [8.441, 1.758, | 0.384, 1.965, 1.294, 2.677, 4.126, 1.238, 1.080, 5.887, 1.194, | 0.453, 1.272] print((np.quantile(sqlite, q=[0.1, 0.9]) / | np.quantile(duckdb, q=[0.1, 0.9])).round()) | | [2] https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html | sitkack wrote: | Do you need compression to get more bandwidth or are you trying | to save money on storage costs? | | DuckDB is a column based db, so you are going to see a | throughput increase for queries that only use a handful of | columns. | [deleted] | obeliskora wrote: | There was neat post https://news.ycombinator.com/item?id=27016630 | a while ago about about using sqlite on static pages with large | datasets that wouldn't have to be loaded entirely. Does duckdb do | something similar with arrow/parquet files or its own format? | ankoh wrote: | Yes we do! DuckDB-Wasm can read files using HTTP range requests | very similar to the sql.js-httpvfs from phiresky. | | The blog post contains a few examples how this can be used, for | example, to partially query Parquet files over the network. | | E.g. just visit shell.duckdb.org and enter: | | select * from 'https://shell.duckdb.org/data/tpch/0_01/parquet/ | orders.parqu...' limit 10; | 1egg0myegg0 wrote: | NIIIICE! Data twitter was pretty excited about that cool | SQLite trick - now you can turn it up a notch! | tomrod wrote: | Is data twitter == #datatwitter, like Econ Twitter is | #econtwitter? | | If so, I have another cool community to follow! | obeliskora wrote: | It would be really cool to load duckdb files too. sql.js- | httpvfs seems convenient because it works on everything in | database so you don't have to create indexes, or setup keys | and constraints in the client. | ankoh wrote: | I agree! DuckDB-Wasm can already open DuckDB database files | in the browser the very same way. | obeliskora wrote: | That's really neat! Can you control the cache too? | ankoh wrote: | DuckDB-Wasm uses a traditional buffer manager and evicts | pages using a combination of FIFO + LRU (to distinguish | sequential scans from hot pages like the Parquet metadata). | joos2010kj wrote: | Awesome! | pantsforbirds wrote: | DuckDB is one of my favorite projects ive stumbled on recently. | I've had multiple use cases pop up where i wanted to do some | pandas type work, but sqlite was a better fit so its really come | in handy for me. | xnx wrote: | Similar(?): https://sql.js.org/ (SQLite in wasm) | domoritz wrote: | Yes but DuckDB is optimized for analytics (columnar data and | vectorized computation). Take a look at the comparison in | https://shell.duckdb.org/versus. | [deleted] | pantsforbirds wrote: | Anyone have a good benchmark comparing DuckDb to Parquet/Avro/ORC | etc.? Super curious to see how some of those workflows might | compare. Obviously at scale its going to be different, but using | a single parquet file/dataset as a db replacement isn't an | uncommon thing in DS/ML work. | mytherin wrote: | Why compare DuckDB to Parquet when you can use DuckDB and | Parquet [1] :) | | [1] https://duckdb.org/2021/06/25/querying-parquet.html | tomnipotent wrote: | Does DuckDB also use a PAX-like format like Parquet? Without | going into code, the best I could find with a little googlefu | is the HyPer/Data Blocks paper - is this a relevant read? | mytherin wrote: | DuckDB's storage format has similar advantages as the | Parquet storage format (e.g. individual columns can be | read, partitions can be skipped, etc) but it is different | because DuckDB's format is designed to do more than Parquet | files. | | Parquet files are intended to store data from a single | table and they are intended to be written-once, where you | write the file and then never change it again. If you want | to change anything in a Parquet file you re-write the file. | | DuckDB's storage format is intended to store an entire | database (multiple tables, views, sequences, etc), and is | intended to support ACID operations on those structures, | such as insertions, updates, deletes, and even altering | tables in the form of adding/removing columns or altering | types of columns without rewriting the entire table or the | entire file. | | Tables are partitioned into row groups much like Parquet, | but unlike Parquet the individual columns of those row | groups are divided into fixed-size blocks so that | individual columns can be fetched from disk. The fixed-size | blocks ensure that the file will not suffer from | fragmentation as the database is modified. | | The storage is still a work in progress, and we are | currently actively working on adding more support for | compression and other goodies, as well as stabilizing the | storage format so that we can maintain backwards | compatibility between versions. ___________________________________________________________________ (page generated 2021-10-29 23:00 UTC)