[HN Gopher] Show HN: WarcDB: Web crawl data as SQLite databases
       ___________________________________________________________________
        
       Show HN: WarcDB: Web crawl data as SQLite databases
        
       Author : fforflo
       Score  : 124 points
       Date   : 2022-06-19 13:26 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jbverschoor wrote:
       | I like that you're logging responses instead of just the result /
       | payload. Reminds me of some idea I had of using something like
       | queue as an intermediary between a webserver and the backend. I
       | don't exactly remember my reasoning anymore right now
        
       | uniqueuid wrote:
       | Since this is pretty new, some background:
       | 
       | WARC is a file format written and read by a rather small but
       | specialized set of web crawler tools, most notably the internet
       | archive's tooling. For example, its java-based crawler heretrix
       | produces warc files.
       | 
       | There are a couple of other very cool tools, such as warcprox,
       | which can create web archives from web browser activity by acting
       | as a proxy, pywb which play back the same to make archived
       | versions browsable, and related libraries [1](shoutout to Noah
       | Levitt, Ilya Kreymer and collaborators for building all this).
       | 
       | The file format itself is an iso standard, and it's very simple:
       | Capture all http headers sent and received, and all http bodies
       | sent and received, and do simple de-duplication based on digests
       | of the body.
       | 
       | There is a companion format, CDX, which builds indexes from warc
       | files (which in turn are just concatenated records, so rather
       | robust).
       | 
       | Although all of this is great, I worry a bit about where we're
       | heading with QUIC / udp-based protocols, websockets and other
       | very involved protocols which ultimately make archival much
       | harder.
       | 
       | If there's anything you can do to help these (or other) tools to
       | keep our web's archival records alive and flowing, please do so.
       | 
       | [1] https://github.com/internetarchive/warcprox
        
         | lijogdfljk wrote:
         | Would a WARC format reduce effort needed to make Reader-like
         | programs? Ie strip pages of HTML cruft, leaving you with text,
         | images, etc - the content?
        
           | simonw wrote:
           | I've found the Readability.js library to be really good for
           | that - here's my recipe for running it as a CLI:
           | https://til.simonwillison.net/shot-scraper/readability
        
           | uniqueuid wrote:
           | Not at all.
           | 
           | A WARC file gives you exactly what you would see on the wire,
           | or in the network inspector tab of your browser. It does
           | nothing to the content, and that's the point.
           | 
           | The only thing you gain (and that's very important for other
           | reasons as well) is an immutable ground truth to work from
           | when creating the reader view of a given article.
        
             | lijogdfljk wrote:
             | Gotcha - yea i was hoping maybe it snapshotted the HTML or
             | some such, side stepping some issues long dynamic text or
             | JS shenanigans
        
         | fforflo wrote:
         | The web archiving community is surprisingly small and
         | fragmented (in terms of tools) given its impact. Thankfully the
         | .warc format looks pretty powerful and standard for the web we
         | have so far (which is a lot! ).
         | 
         | Now with the new protocols, dunno maybe its too soon to worry?
         | Then again, maybe its an IPv4 / IPv6 analogy.
         | 
         | > There is a companion format, CDX, which builds indexes from
         | warc files (which in turn are just concatenated records, so
         | rather robust).
         | 
         | Good point. I's planning of combining this fact with the ATTACH
         | option that SQLite has - allowing to query multiple database
         | files [0]
         | 
         | [0] https://www.sqlite.org/lang_attach.html
        
           | uniqueuid wrote:
           | Oh hi, thanks for building this! I haven't had the chance to
           | play with it, but my hunch is that sqlite for warc can fill a
           | great niche and would be much more portable (and probably
           | performant).
           | 
           | Allowing multiple DB files is a great idea, since that
           | fundamentally enables large archives, cold-storage and so on.
        
         | [deleted]
        
         | ma2rten wrote:
         | Common Crawl is also in WARC format.
        
       | tepitoperrito wrote:
       | It'd be neat to extend the warc format and tooling to support
       | cached http responses for things like REST endpoints. Then you
       | could make sure everything you did in a session is recorded for
       | later use.
       | 
       | From the specification it would appear fairly straightforward
       | once an approach was chosen...
       | 
       | Here's the relevant extract from the warc spec that informed my
       | difficulty estimate - "The WARC (Web ARChive) file format offers
       | a convention for concatenating multiple resource records (data
       | objects), each consisting of a set of simple text headers and an
       | arbitrary data block into one long file."
       | 
       | Edit: Upon 2 minutes of reflection I think the way to go for what
       | I'm envisioning is some kind of browser session recording ->
       | replayable archive solution.
        
         | uniqueuid wrote:
         | Although this is a nice idea, it's extremely difficult to get
         | it completely right.
         | 
         | Consider a SPA where navigation happens via xhr or similar
         | requests and updates are json that's patched into the DOM. Even
         | browsers have a hard time figuring out how to make this a
         | coherent session.
         | 
         | Now with warc, you get a single record per _transfer_ , i.e.
         | every json file, every image, every css is an individual
         | record. It's completely up to the client/downstream tech to re-
         | assemble this into a coherent page.
         | 
         | If you want to go down that road, my best suggestion would be
         | to start with a browser's history - that's probably the most
         | solid version of a session that we have right now.
        
       | fforflo wrote:
       | I built this as a small utility within a larger project I'm
       | working on these days. (contact me if you're curious or want to
       | support).
       | 
       | The WARC format is extremely simple and yet so powerful. Most
       | importantly though, there are already pebibytes of already
       | crawled archives.
       | 
       | This is a fairly straightforward mapping of a .warc file to a
       | .sqlite database. The goal is to make such archives SQL-able even
       | in smaller pieces.
       | 
       | The schema I've come up with it's tailored around my
       | requirements, but comment if you can spot any obvious pitfalls.
       | 
       | PS: I do believe that at some point .sqlite will become the
       | defacto standard for such initiatives. Sure, it's not text... but
       | it's pretty close.
        
         | marginalia_nu wrote:
         | > PS: I do believe that at some point .sqlite will become the
         | defacto standard for such initiatives. Sure, it's not text...
         | but it's pretty close.
         | 
         | What is the advantage of moving around .sqlite-files, over just
         | loading the (compressed) WARCs into sqlite databases when you
         | need them?
         | 
         | I've been messing around with different formats for my own
         | search engine crawls, and ended up with the conclusion that
         | WARC is a pretty amazing intermediary format that weighs both
         | the needs of the producer and consumer very well. I don't use
         | WARCs now, instead something similar, but I probably will
         | migrate toward that format eventually.
         | 
         | WARC's real selling pint is that it's such an extremely
         | portable format.
        
           | uniqueuid wrote:
           | Anecdotal evidence, but I produced a medium-size crawl in the
           | past (~20TB compressed). I used distributed resources with
           | off-the-shelf libraries (i.e. warcprox etc.) and managed to
           | get corrupted data in some cases where neither the length-
           | delimited (i.e. offset + payload length) nor the newline-
           | delimited (triple newlines between records) logics were valid
           | any longer. Took me some time to build a repair tool for
           | that.
           | 
           | Sqlite has an amazing set of well-understood and documented
           | guarantees on top of performance, there's a host of potential
           | validation tools to choose from and you can even use
           | transactions etc. So that alone seems like a great idea.
           | 
           | What's more, you can potentially skip CDX files if you have
           | sqlite databases (or build your own meta sqlite database for
           | the others quickly).
        
             | rengler33 wrote:
             | Is there a forum or somewhere web crawlers hang out online?
             | I'd love the learn about more sophisticated projects like
             | this.
        
               | uniqueuid wrote:
               | In github issues of said projects, and at scientific web
               | archival conferences.
               | 
               | Although I'd absolutely welcome some sort of channel!
        
               | smcnally wrote:
               | Topics include Issues and activities across projects.
               | These topics are quite active, e.g.
               | 
               | https://github.com/topics/crawling
               | https://github.com/topics/web-scraping
               | https://github.com/topics/web-archiving
        
           | mynameismon wrote:
           | > What is the advantage of moving around .sqlite-files, over
           | just loading the (compressed) WARCs into sqlite databases
           | when you need them?
           | 
           | I suppose there could be made a case for easy extension: You
           | don't need to change the entire spec to add another table in
           | the SQLite database, maybe containing other metadata.
           | 
           | > WARC's real selling pint is that it's such an extremely
           | portable format.
           | 
           | I mean, so is SQLite: it is also apporoved as a LoC archival
           | method. (See SQL Archive [0])
           | 
           | [0]: https://www.sqlite.org/sqlar.html
        
           | fforflo wrote:
           | > What is the advantage of moving around .sqlite-files, over
           | just loading the (compressed) WARCs into sqlite databases
           | when you need them?
           | 
           | The .warc spec is ideal. I'm not saying we replace it (ref.
           | xkcd: standards).
           | 
           | On top of what uniqueid said, "loading" is much slower and
           | more cumbersome than it sounds. I'm not saying SQLite will
           | replace text (maybe my aphorism sounded too firm). I'm saying
           | that maybe along with the .warc.gz archives at rest, one
           | could have .sql.gz files at rest as well.
           | 
           | In other words: why not move ACID-compliant archives moving
           | around?
        
             | marginalia_nu wrote:
             | Seems like the benefit of sqlite is the sort of usecases
             | where you maybe don't want to load everything you've
             | crawled into a search engine, but want to be able to
             | cherrypick the data and retrieve specific documents for
             | further processing. Which is certainly a use case that
             | exists, and indeed not really what WARC is designed for.
        
               | traverseda wrote:
               | Great for most end user facing applications though.
        
               | marginalia_nu wrote:
               | End-user facing applications usually don't consume
               | website crawls, do they? That's impractical for many
               | reasons, the sheer size alone being perhaps the biggest
               | obstacle.
               | 
               | If you want to do something like have an offline copy of
               | a website, ZIM[1] is a far more suitable format as it's
               | extremely space-efficient and also fast.
               | 
               | [1] https://docs.fileformat.com/compression/zim/
        
         | nlohmann wrote:
         | Have you every played with SQLite virtual tables
         | (https://sqlite.org/vtab.html) - they could allow to provide an
         | SQLite interface while keeping the same structure on disk.
         | Though it requires a bit of work (implementing the interface
         | can be tedious), it can avoid the conversion in the first
         | place.
        
           | fforflo wrote:
           | Then again, do you need virtual tables? The .warc structure
           | won't change, so the tables won't change. But you can have
           | SQL views defined instead for common queries.
        
           | fforflo wrote:
           | Good point. Actually CommonCrawl provides Parquet files for
           | their archives too.
           | 
           | And there's this vtable for Parquet extension.
           | https://github.com/cldellow/sqlite-parquet-vtable
           | 
           | But for my use case virtual would be too complicated.
        
       ___________________________________________________________________
       (page generated 2022-06-19 23:00 UTC)