[HN Gopher] Show HN: Easily Convert WARC (Web Archive) into Parq...
       ___________________________________________________________________
        
       Show HN: Easily Convert WARC (Web Archive) into Parquet, Then Query
       with DuckDB
        
       Author : llambda
       Score  : 61 points
       Date   : 2022-06-24 18:26 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mritchie712 wrote:
       | Nice! I've been considering using DuckDB for our product (to
       | speed up join's and aggregates of in-memory data), it's an
       | incredible technology.
        
       | wahnfrieden wrote:
       | How does this compare with SQLite approaches shared recently?
        
         | infogulch wrote:
         | Well there's a virtual table extension to read parquet files in
         | SQLite. I've not tried it myself.
         | https://github.com/cldellow/sqlite-parquet-vtable
        
           | westurner wrote:
           | Could this work with datasette (which is a flexible interface
           | to sqlite with a web-based query editor)?
        
         | llambda wrote:
         | It's a great question: fundamentally the Parquet format offers
         | columnar orientation. With datasets like these, there's some
         | research[0] indicating this is a preferable way of storing and
         | querying WARC.
         | 
         | DuckDB, like SQLite, is serverless. Duck has a leg up on SQLite
         | though when it comes to Parquet: Parquet is supported directly
         | in Duck and this makes dealing with these datasets a breeze.
         | 
         | [0] https://www.researchgate.net/figure/Comparing-WARC-CDX-
         | Parqu...
        
         | 1egg0myegg0 wrote:
         | Good question! As a disclaimer, I work for DuckDB Labs.
         | 
         | There are 2 big benefits to working with Parquet files in
         | DuckDB, and both relate to speed!
         | 
         | DuckDB can query parquet right where it sits, so there is no
         | need to insert it into the db first. This is typically much
         | faster. Also, DuckDB's engine is columnar (SQLite is row
         | based), so it can do faster analytical queries using that
         | format. I have seen 20-100x speed improvements over SQLite in
         | analytical workloads.
         | 
         | Happy to answer any questions!
        
           | arpinum wrote:
           | Do you see DuckDB as a possible replacement for AWS Athena?
           | Where would Athena still be better than DuckDB + Parquet +
           | Lambda?
        
             | wenc wrote:
             | DuckDB user here. As far as I can tell, DuckDB doesn't
             | support distributed computation so you have to set that up
             | yourself, whereas Athena is essentially Presto -- it
             | handles that detail for you. It also doesn't support Avro
             | or Orc yet.
             | 
             | DuckDB excels at single machine compute where everything
             | fits in memory or is streamable (data can be local or on
             | S3) -- it's lightweight and vectorized. I use it in Jupyter
             | notebooks and in Python code.
             | 
             | But it may not be the right tool if you need distributed
             | compute over a very large dataset.
        
         | wenc wrote:
         | DuckDB has SQLite semantics but is natively built around
         | columnar formats (parquet, in-memory Arrow) and strong types
         | (including dates). It also supports very complex SQL.
         | 
         | SQLite is a row store built around row based transactional
         | workloads. DuckDB is built around analytics workloads (lots of
         | filtering, aggregations and transformations) and for these
         | workloads DuckDB is just way way faster. Source: personal
         | experience.
        
       ___________________________________________________________________
       (page generated 2022-06-24 23:00 UTC)