[HN Gopher] Show HN: Easily Convert WARC (Web Archive) into Parq... ___________________________________________________________________ Show HN: Easily Convert WARC (Web Archive) into Parquet, Then Query with DuckDB Author : llambda Score : 61 points Date : 2022-06-24 18:26 UTC (4 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | mritchie712 wrote: | Nice! I've been considering using DuckDB for our product (to | speed up join's and aggregates of in-memory data), it's an | incredible technology. | wahnfrieden wrote: | How does this compare with SQLite approaches shared recently? | infogulch wrote: | Well there's a virtual table extension to read parquet files in | SQLite. I've not tried it myself. | https://github.com/cldellow/sqlite-parquet-vtable | westurner wrote: | Could this work with datasette (which is a flexible interface | to sqlite with a web-based query editor)? | llambda wrote: | It's a great question: fundamentally the Parquet format offers | columnar orientation. With datasets like these, there's some | research[0] indicating this is a preferable way of storing and | querying WARC. | | DuckDB, like SQLite, is serverless. Duck has a leg up on SQLite | though when it comes to Parquet: Parquet is supported directly | in Duck and this makes dealing with these datasets a breeze. | | [0] https://www.researchgate.net/figure/Comparing-WARC-CDX- | Parqu... | 1egg0myegg0 wrote: | Good question! As a disclaimer, I work for DuckDB Labs. | | There are 2 big benefits to working with Parquet files in | DuckDB, and both relate to speed! | | DuckDB can query parquet right where it sits, so there is no | need to insert it into the db first. This is typically much | faster. Also, DuckDB's engine is columnar (SQLite is row | based), so it can do faster analytical queries using that | format. I have seen 20-100x speed improvements over SQLite in | analytical workloads. | | Happy to answer any questions! | arpinum wrote: | Do you see DuckDB as a possible replacement for AWS Athena? | Where would Athena still be better than DuckDB + Parquet + | Lambda? | wenc wrote: | DuckDB user here. As far as I can tell, DuckDB doesn't | support distributed computation so you have to set that up | yourself, whereas Athena is essentially Presto -- it | handles that detail for you. It also doesn't support Avro | or Orc yet. | | DuckDB excels at single machine compute where everything | fits in memory or is streamable (data can be local or on | S3) -- it's lightweight and vectorized. I use it in Jupyter | notebooks and in Python code. | | But it may not be the right tool if you need distributed | compute over a very large dataset. | wenc wrote: | DuckDB has SQLite semantics but is natively built around | columnar formats (parquet, in-memory Arrow) and strong types | (including dates). It also supports very complex SQL. | | SQLite is a row store built around row based transactional | workloads. DuckDB is built around analytics workloads (lots of | filtering, aggregations and transformations) and for these | workloads DuckDB is just way way faster. Source: personal | experience. ___________________________________________________________________ (page generated 2022-06-24 23:00 UTC)