[HN Gopher] DuckDB-Wasm: Efficient analytical SQL in the browser
       ___________________________________________________________________
        
       DuckDB-Wasm: Efficient analytical SQL in the browser
        
       Author : ankoh
       Score  : 172 points
       Date   : 2021-10-29 14:58 UTC (8 hours ago)
        
 (HTM) web link (duckdb.org)
 (TXT) w3m dump (duckdb.org)
        
       | typingmonkey wrote:
       | Does DuckDB support multi tab usage? How big is the wasm file
       | that must be loaded?
        
         | ankoh wrote:
         | The WebAssembly module is 1.6 - 1.8 MB brotli-compressed
         | depending on the Wasm feature set. We're currently
         | investigating ways to reduce this to around 1 MB. We further
         | use streaming instantiation which means that the WebAssembly
         | module will be compiled while downloading it. But still, it
         | will hurt a bit more than a 40KB library.
         | 
         | Regarding multi-tab usage: Not today. The available filesystem
         | apis make it difficult to implement this right now. We're
         | looking into ways to make DuckDB-Wasm persistent but we can
         | only read in this release.
        
         | domoritz wrote:
         | On https://shell.duckdb.org/versus, we have a comparison with
         | related libraries. The WASM bundles currently is 1.8 MB but it
         | can be instantiated while it's streaming in.
         | 
         | The size probably makes it prohibitive to use DuckDB when your
         | dataset is small and download size matters but we hope that
         | future improvements in WebAssembly can get the size down.
        
       | [deleted]
        
       | timwis wrote:
       | Interesting.. Would this be effective at loading a remote CSV
       | file with a million rows, then performing basic GROUP BY COUNTs
       | on it so I can render bar charts?
       | 
       | I've been thinking of using absurd-sql for it since I saw
       | https://news.ycombinator.com/item?id=28156831 last week
        
         | texodus wrote:
         | I contribute to https://perspective.finos.org/ , supports all
         | of this and quite a lot more.
         | 
         | Here's 1,000,000 rows example I just threw together for you
         | 
         | https://bl.ocks.org/texodus/3802a8671fa77399c7842fd0deffe925
         | 
         | and a CSV example, you try yours right now
         | 
         | https://bl.ocks.org/texodus/02d8fd10aef21b19d6165cf92e43e668
        
         | ankoh wrote:
         | It depends.
         | 
         | Querying CSV files is particularly painful over the network
         | since we still have to read everything for a full scan.
         | 
         | With Parquet, you would at least only have to read the columns
         | of group by keys and aggregate arguments.
         | 
         | Try it out and share your experiences with us!
        
       | elmolino89 wrote:
       | Not really DuckDB-Wasm question but DuckDB:
       | 
       | I got a data sets probably not suitable for loading into a memory
       | table (close to 1000M rows CSV). I did split it into 20M rows
       | chunks, read one by one into a DuckDB temporary table and
       | exported as parquet.
       | 
       | SELECT using glob prefix.*.parquet where mycolumb=foobar does
       | work but can be a bit faster. Apart from sorting the input to
       | parquet CSVs, what can he done? The CSV chunks were already
       | sorted.
        
       | fnord77 wrote:
       | this looks so cool. is there pre-loaded a demo page loaded with
       | tables so people can try out queries right away?
        
         | 1egg0myegg0 wrote:
         | If you head over to the shell demo, you can run a query like
         | the one below!
         | 
         | https://shell.duckdb.org/
         | 
         | select * from 'https://shell.duckdb.org/data/tpch/0_01/parquet/
         | orders.parqu...' limit 10;
        
       | aynyc wrote:
       | I wish it supports ORC.
        
       | crimsoneer wrote:
       | I'm still not sure I "get" the use case for DuckDB. From what I
       | understand, it's like a nifty, in-memory SQL, but why is that
       | better than just running PostGRES or Microsoft SQL server
       | locally, where your data structures and tables and stuff have a
       | lot more permanence?
       | 
       | Like, my workflow is either I query an exiting remote corporate
       | DB and do my initial data munging there, or get givne a data dump
       | that I either work on directly in Pandas, or add to a local DB
       | and do a little more cleaning there. Not at all clear how Duck DB
       | would hel
        
         | nicoburns wrote:
         | DuckDB is columnar, so in theory a lot faster than Postgres or
         | SQL server for Analytical workloads.
         | 
         | DuckDB is to Clickhouse, TimescaleDB, Redshift, etc as SQLite
         | is to Postgres, MySQL, SQL Server.
        
           | deadliftpro wrote:
           | From where do you get that sql server does not support
           | columnar? That is a wrong claim.
        
             | wenc wrote:
             | I don't think the OP said that SQL Server doesn't support
             | columnar, only that by analogy SQL Server is primarily a
             | row store (which for most of its history was true).
             | 
             | Columnar technology in SQL Server only became usable in SQL
             | 2016 (it existed in 2012 but was too restrictive -- I know
             | because I tried to use it).
             | 
             | In 2016 and above, you can either create a columnar index
             | (non clustered column store) or convert an entire table
             | into a columnar table (clustered column store). The
             | technology is actually pretty impressive and I've used it
             | in production where I have row stores coexisting with
             | column stores within the same database.
        
         | Dayshine wrote:
         | The simple answer noone else seems to have mentioned: SQLLite
         | has quite a low limit on the number of columns it supports,
         | which is a problem for data analytics which often prefers wide
         | over long.
        
         | 1egg0myegg0 wrote:
         | Check out this post for some comparisons with Pandas.
         | 
         | https://duckdb.org/2021/05/14/sql-on-pandas.html
         | 
         | DuckDB is often faster than Pandas, and it can handle larger
         | than memory data. Plus, if you already know SQL, you don't have
         | to become a Pandas expert to be productive in Python data
         | munging. Pandas is still good, but now you can mix and match
         | with SQL!
        
         | catawbasam wrote:
         | Not just in-memory. It's pretty convenient if you have a set of
         | Parquet files with common schema. Fairly snappy and doesn't
         | have to fit in memory.
        
           | jamesrr39 wrote:
           | I'm using duckdb for querying parquet files as well. It's an
           | awesome tool, so nice to just "look into" parquet files with
           | SQL.
        
             | deshpand wrote:
             | Many enterprises are coming up with patterns where they
             | replicate the data from the database (say Redshift) into
             | parquet files (data lake?) and directing more traffic
             | including analytical workloads onto the parquet files.
             | 
             | duckdb will be very useful here, instead of having to use
             | Redshift Spectrum or whatever.
        
         | deshpand wrote:
         | I work heavily with pandas and dask (when you want to use
         | multiple cores), using parquet files for storage. We see a lot
         | of benefits in selectively bringing in duckdb into the mix. For
         | instance, the joins are extremely slow with both pandas and
         | dask and require a lot of memory. That's a situation where
         | using duckdb reduces the memory needs and speeds things up a
         | lot.
         | 
         | And we may not want to upload the data into postgres or another
         | database. We can just work with parquet files and run in-
         | process queries.
        
         | mytherin wrote:
         | DuckDB developer here. DuckDB is a regular RDBMS that has
         | persistent ACID storage, but is tuned towards analytical
         | workloads, i.e. read-heavy workloads with aggregates that
         | require full scans of the data. Any data you write to tables is
         | stored persistently on disk, and not all your data needs to fit
         | in memory either.
         | 
         | Our tagline is "SQLite for analytics", as DuckDB is an in-
         | process database system similar to SQLite that is geared
         | towards these types of workloads.
         | 
         | DuckDB has a flexible query engine, and also has support for
         | directly running SQL queries (in parallel!) on top of Pandas
         | [1] and Parquet [2] without requiring the data to be imported
         | into the system.
         | 
         | [1] https://duckdb.org/2021/05/14/sql-on-pandas.html
         | 
         | [2] https://duckdb.org/2021/06/25/querying-parquet.html
        
           | brandmeyer wrote:
           | Maybe this is a silly question: Why is the A/B choice between
           | a row-major database and a column-major database, instead of
           | between row-major tables and column-major tables within a
           | flexible database?
           | 
           | What's stopping the other leading brands from implementing
           | columnar storage, queries, and such with a COLUMN MAJOR table
           | attribute?
        
             | dgudkov wrote:
             | SQL calculations on columnar data are quite different from
             | row-based databases, so its effectively a different
             | database engine. You can take multiple advantages of
             | columnar data store, because it usually employs a form of
             | vocabulary compression. For instance, obtaining distinct
             | values of a field in a columnar DB is much faster because
             | it's typically just the vocabulary of the field, so it
             | doesn't even require a full table scan. Many other columnar
             | computations such as filtering or aggregation can be done
             | on compressed data without decompression.
        
             | mytherin wrote:
             | Some databases do offer both, but it is much more involved
             | than just changing the storage model. The entire query
             | execution model needs to adapt to columnar execution. You
             | can simulate a column store model in a row-store database
             | by splitting a table into a series of single-column tables,
             | but the performance benefits you will capture are much
             | smaller than a system that is designed and optimized for
             | column store execution.
        
         | loxias wrote:
         | To me, the use case is really obvious: when you reached for
         | SQLite but now want something with Moar Powah.
         | 
         | Now I've reduced it to a previously answered question: what's
         | the use case for SQLite? ;)
         | https://www.sqlite.org/whentouse.html
         | 
         | That being said, I don't see the point, and shudder at the idea
         | of a web page's javascript doing anything which _needs_
         | noticeable amounts of the CPU, but I 'm a non-standard user...
        
         | kf6nux wrote:
         | I don't fully get the use case either, but it's in a different
         | category than Postgres or Microsoft SQL because it runs in the
         | browser and can be made part of your web app.
        
           | ankoh wrote:
           | DuckDB-wasm is targeting the browser so it's not directly
           | competing with Pandas (that's the job of native DuckDB).
           | 
           | It's targeting use cases where you want to push analytical
           | computation away from servers into the client (browser).
           | 
           | Lets me sketch 2 examples:
           | 
           | A) You have your data sitting in S3 and the user-specific
           | data is in a browser-manageable area.
           | 
           | (E.g. this paper from Tableau research actually states
           | dataset sizes that should fall into that category: https://re
           | search.tableau.com/sites/default/files/get_real.pd...)
           | 
           | In that scenario, you could eliminate your central server if
           | your clients are smart enough.
           | 
           | B) You are talking about larger dataset sizes (GB) and want
           | to explore them ad-hoc in your browser.
           | 
           | Uploading them is unrealistic and installing additional
           | software is no longer ad-hoc enough.
        
       | sedatesteak wrote:
       | I got excited about duckdb recently too. Used it yesterday for a
       | new project at work and immediately ran into a not implemented
       | exception for my (awful) column naming structure and discovered
       | there is no pivot function.
       | 
       | Otherwise, it's great, but obviously still a wip.
       | 
       | For those wondering, I have a helper function for soql queries
       | for salesforce that follows the structure object.field
       | 
       | Referring to a tablealias.[column.name] or quotes instead of
       | brackets was a no go.
        
       | sgarrity wrote:
       | How does this compare/relate to https://jlongster.com/future-sql-
       | web (if at all)?
        
         | ankoh wrote:
         | The author outlines many problems that you'll run into when
         | implementing a persistent storage backend using the current
         | browser APIs.
         | 
         | We faced many of them ourselves but paused any further work on
         | an IndexedDB-backend due to the lack of synchronous IndexedDB
         | apis (e.g. check the warning here
         | https://developer.mozilla.org/en-
         | US/docs/Web/API/IDBDatabase...). He bypasses this issue using
         | SharedArrayBuffers which would lock DuckDB-Wasm to cross-
         | origin-isolated sites. (See the "Multithreading" section in our
         | blog post)
         | 
         | We might be able to lift this limitation in the future but this
         | has some far-reaching implications affecting the query
         | execution of DuckDB itself.
         | 
         | To the best of my knowledge, there's just no way to do
         | synchronous persistency efficiently right now that wont lock
         | you to a browser or cross-origin-isolation. But this will be
         | part of our ongoing research.
        
       | munro wrote:
       | Cool! This is the first time hearing about DuckDB, exciting as I
       | heavily use SQLite. And these benchmarks are showing it's 6-15
       | times faster than sql.js (SQLite) [1], along with another small
       | benchmark I found [2]. I usually just slather on indexes in
       | SQLite tho, so indexed queries may not stand up as well; and
       | might not be as fast when it's on storage, as this is comparing
       | in memory performance (I think?), but I'll give it a spin!
       | 
       | Gonna throw this out there: main thing I'm looking for from an
       | embedded DB is better on disk compression; I've been toying with
       | RocksDB, but it's hard to tune optimally & it's really too low
       | level for my needs.
       | 
       | [1] > ipython                   import numpy as np         duckdb
       | = [0.855, 0.179, 0.151, 0.197, 0.086, 0.319, 0.236, 0.351, 0.276,
       | 0.194, 0.086, 0.137, 0.377]         sqlite = [8.441, 1.758,
       | 0.384, 1.965, 1.294, 2.677, 4.126, 1.238, 1.080, 5.887, 1.194,
       | 0.453, 1.272]         print((np.quantile(sqlite, q=[0.1, 0.9]) /
       | np.quantile(duckdb, q=[0.1, 0.9])).round())
       | 
       | [2] https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html
        
         | sitkack wrote:
         | Do you need compression to get more bandwidth or are you trying
         | to save money on storage costs?
         | 
         | DuckDB is a column based db, so you are going to see a
         | throughput increase for queries that only use a handful of
         | columns.
        
       | [deleted]
        
       | obeliskora wrote:
       | There was neat post https://news.ycombinator.com/item?id=27016630
       | a while ago about about using sqlite on static pages with large
       | datasets that wouldn't have to be loaded entirely. Does duckdb do
       | something similar with arrow/parquet files or its own format?
        
         | ankoh wrote:
         | Yes we do! DuckDB-Wasm can read files using HTTP range requests
         | very similar to the sql.js-httpvfs from phiresky.
         | 
         | The blog post contains a few examples how this can be used, for
         | example, to partially query Parquet files over the network.
         | 
         | E.g. just visit shell.duckdb.org and enter:
         | 
         | select * from 'https://shell.duckdb.org/data/tpch/0_01/parquet/
         | orders.parqu...' limit 10;
        
           | 1egg0myegg0 wrote:
           | NIIIICE! Data twitter was pretty excited about that cool
           | SQLite trick - now you can turn it up a notch!
        
             | tomrod wrote:
             | Is data twitter == #datatwitter, like Econ Twitter is
             | #econtwitter?
             | 
             | If so, I have another cool community to follow!
        
           | obeliskora wrote:
           | It would be really cool to load duckdb files too. sql.js-
           | httpvfs seems convenient because it works on everything in
           | database so you don't have to create indexes, or setup keys
           | and constraints in the client.
        
             | ankoh wrote:
             | I agree! DuckDB-Wasm can already open DuckDB database files
             | in the browser the very same way.
        
           | obeliskora wrote:
           | That's really neat! Can you control the cache too?
        
             | ankoh wrote:
             | DuckDB-Wasm uses a traditional buffer manager and evicts
             | pages using a combination of FIFO + LRU (to distinguish
             | sequential scans from hot pages like the Parquet metadata).
        
       | joos2010kj wrote:
       | Awesome!
        
       | pantsforbirds wrote:
       | DuckDB is one of my favorite projects ive stumbled on recently.
       | I've had multiple use cases pop up where i wanted to do some
       | pandas type work, but sqlite was a better fit so its really come
       | in handy for me.
        
       | xnx wrote:
       | Similar(?): https://sql.js.org/ (SQLite in wasm)
        
         | domoritz wrote:
         | Yes but DuckDB is optimized for analytics (columnar data and
         | vectorized computation). Take a look at the comparison in
         | https://shell.duckdb.org/versus.
        
         | [deleted]
        
       | pantsforbirds wrote:
       | Anyone have a good benchmark comparing DuckDb to Parquet/Avro/ORC
       | etc.? Super curious to see how some of those workflows might
       | compare. Obviously at scale its going to be different, but using
       | a single parquet file/dataset as a db replacement isn't an
       | uncommon thing in DS/ML work.
        
         | mytherin wrote:
         | Why compare DuckDB to Parquet when you can use DuckDB and
         | Parquet [1] :)
         | 
         | [1] https://duckdb.org/2021/06/25/querying-parquet.html
        
           | tomnipotent wrote:
           | Does DuckDB also use a PAX-like format like Parquet? Without
           | going into code, the best I could find with a little googlefu
           | is the HyPer/Data Blocks paper - is this a relevant read?
        
             | mytherin wrote:
             | DuckDB's storage format has similar advantages as the
             | Parquet storage format (e.g. individual columns can be
             | read, partitions can be skipped, etc) but it is different
             | because DuckDB's format is designed to do more than Parquet
             | files.
             | 
             | Parquet files are intended to store data from a single
             | table and they are intended to be written-once, where you
             | write the file and then never change it again. If you want
             | to change anything in a Parquet file you re-write the file.
             | 
             | DuckDB's storage format is intended to store an entire
             | database (multiple tables, views, sequences, etc), and is
             | intended to support ACID operations on those structures,
             | such as insertions, updates, deletes, and even altering
             | tables in the form of adding/removing columns or altering
             | types of columns without rewriting the entire table or the
             | entire file.
             | 
             | Tables are partitioned into row groups much like Parquet,
             | but unlike Parquet the individual columns of those row
             | groups are divided into fixed-size blocks so that
             | individual columns can be fetched from disk. The fixed-size
             | blocks ensure that the file will not suffer from
             | fragmentation as the database is modified.
             | 
             | The storage is still a work in progress, and we are
             | currently actively working on adding more support for
             | compression and other goodies, as well as stabilizing the
             | storage format so that we can maintain backwards
             | compatibility between versions.
        
       ___________________________________________________________________
       (page generated 2021-10-29 23:00 UTC)