[HN Gopher] Show HN: WarcDB: Web crawl data as SQLite databases ___________________________________________________________________ Show HN: WarcDB: Web crawl data as SQLite databases Author : fforflo Score : 124 points Date : 2022-06-19 13:26 UTC (9 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | jbverschoor wrote: | I like that you're logging responses instead of just the result / | payload. Reminds me of some idea I had of using something like | queue as an intermediary between a webserver and the backend. I | don't exactly remember my reasoning anymore right now | uniqueuid wrote: | Since this is pretty new, some background: | | WARC is a file format written and read by a rather small but | specialized set of web crawler tools, most notably the internet | archive's tooling. For example, its java-based crawler heretrix | produces warc files. | | There are a couple of other very cool tools, such as warcprox, | which can create web archives from web browser activity by acting | as a proxy, pywb which play back the same to make archived | versions browsable, and related libraries [1](shoutout to Noah | Levitt, Ilya Kreymer and collaborators for building all this). | | The file format itself is an iso standard, and it's very simple: | Capture all http headers sent and received, and all http bodies | sent and received, and do simple de-duplication based on digests | of the body. | | There is a companion format, CDX, which builds indexes from warc | files (which in turn are just concatenated records, so rather | robust). | | Although all of this is great, I worry a bit about where we're | heading with QUIC / udp-based protocols, websockets and other | very involved protocols which ultimately make archival much | harder. | | If there's anything you can do to help these (or other) tools to | keep our web's archival records alive and flowing, please do so. | | [1] https://github.com/internetarchive/warcprox | lijogdfljk wrote: | Would a WARC format reduce effort needed to make Reader-like | programs? Ie strip pages of HTML cruft, leaving you with text, | images, etc - the content? | simonw wrote: | I've found the Readability.js library to be really good for | that - here's my recipe for running it as a CLI: | https://til.simonwillison.net/shot-scraper/readability | uniqueuid wrote: | Not at all. | | A WARC file gives you exactly what you would see on the wire, | or in the network inspector tab of your browser. It does | nothing to the content, and that's the point. | | The only thing you gain (and that's very important for other | reasons as well) is an immutable ground truth to work from | when creating the reader view of a given article. | lijogdfljk wrote: | Gotcha - yea i was hoping maybe it snapshotted the HTML or | some such, side stepping some issues long dynamic text or | JS shenanigans | fforflo wrote: | The web archiving community is surprisingly small and | fragmented (in terms of tools) given its impact. Thankfully the | .warc format looks pretty powerful and standard for the web we | have so far (which is a lot! ). | | Now with the new protocols, dunno maybe its too soon to worry? | Then again, maybe its an IPv4 / IPv6 analogy. | | > There is a companion format, CDX, which builds indexes from | warc files (which in turn are just concatenated records, so | rather robust). | | Good point. I's planning of combining this fact with the ATTACH | option that SQLite has - allowing to query multiple database | files [0] | | [0] https://www.sqlite.org/lang_attach.html | uniqueuid wrote: | Oh hi, thanks for building this! I haven't had the chance to | play with it, but my hunch is that sqlite for warc can fill a | great niche and would be much more portable (and probably | performant). | | Allowing multiple DB files is a great idea, since that | fundamentally enables large archives, cold-storage and so on. | [deleted] | ma2rten wrote: | Common Crawl is also in WARC format. | tepitoperrito wrote: | It'd be neat to extend the warc format and tooling to support | cached http responses for things like REST endpoints. Then you | could make sure everything you did in a session is recorded for | later use. | | From the specification it would appear fairly straightforward | once an approach was chosen... | | Here's the relevant extract from the warc spec that informed my | difficulty estimate - "The WARC (Web ARChive) file format offers | a convention for concatenating multiple resource records (data | objects), each consisting of a set of simple text headers and an | arbitrary data block into one long file." | | Edit: Upon 2 minutes of reflection I think the way to go for what | I'm envisioning is some kind of browser session recording -> | replayable archive solution. | uniqueuid wrote: | Although this is a nice idea, it's extremely difficult to get | it completely right. | | Consider a SPA where navigation happens via xhr or similar | requests and updates are json that's patched into the DOM. Even | browsers have a hard time figuring out how to make this a | coherent session. | | Now with warc, you get a single record per _transfer_ , i.e. | every json file, every image, every css is an individual | record. It's completely up to the client/downstream tech to re- | assemble this into a coherent page. | | If you want to go down that road, my best suggestion would be | to start with a browser's history - that's probably the most | solid version of a session that we have right now. | fforflo wrote: | I built this as a small utility within a larger project I'm | working on these days. (contact me if you're curious or want to | support). | | The WARC format is extremely simple and yet so powerful. Most | importantly though, there are already pebibytes of already | crawled archives. | | This is a fairly straightforward mapping of a .warc file to a | .sqlite database. The goal is to make such archives SQL-able even | in smaller pieces. | | The schema I've come up with it's tailored around my | requirements, but comment if you can spot any obvious pitfalls. | | PS: I do believe that at some point .sqlite will become the | defacto standard for such initiatives. Sure, it's not text... but | it's pretty close. | marginalia_nu wrote: | > PS: I do believe that at some point .sqlite will become the | defacto standard for such initiatives. Sure, it's not text... | but it's pretty close. | | What is the advantage of moving around .sqlite-files, over just | loading the (compressed) WARCs into sqlite databases when you | need them? | | I've been messing around with different formats for my own | search engine crawls, and ended up with the conclusion that | WARC is a pretty amazing intermediary format that weighs both | the needs of the producer and consumer very well. I don't use | WARCs now, instead something similar, but I probably will | migrate toward that format eventually. | | WARC's real selling pint is that it's such an extremely | portable format. | uniqueuid wrote: | Anecdotal evidence, but I produced a medium-size crawl in the | past (~20TB compressed). I used distributed resources with | off-the-shelf libraries (i.e. warcprox etc.) and managed to | get corrupted data in some cases where neither the length- | delimited (i.e. offset + payload length) nor the newline- | delimited (triple newlines between records) logics were valid | any longer. Took me some time to build a repair tool for | that. | | Sqlite has an amazing set of well-understood and documented | guarantees on top of performance, there's a host of potential | validation tools to choose from and you can even use | transactions etc. So that alone seems like a great idea. | | What's more, you can potentially skip CDX files if you have | sqlite databases (or build your own meta sqlite database for | the others quickly). | rengler33 wrote: | Is there a forum or somewhere web crawlers hang out online? | I'd love the learn about more sophisticated projects like | this. | uniqueuid wrote: | In github issues of said projects, and at scientific web | archival conferences. | | Although I'd absolutely welcome some sort of channel! | smcnally wrote: | Topics include Issues and activities across projects. | These topics are quite active, e.g. | | https://github.com/topics/crawling | https://github.com/topics/web-scraping | https://github.com/topics/web-archiving | mynameismon wrote: | > What is the advantage of moving around .sqlite-files, over | just loading the (compressed) WARCs into sqlite databases | when you need them? | | I suppose there could be made a case for easy extension: You | don't need to change the entire spec to add another table in | the SQLite database, maybe containing other metadata. | | > WARC's real selling pint is that it's such an extremely | portable format. | | I mean, so is SQLite: it is also apporoved as a LoC archival | method. (See SQL Archive [0]) | | [0]: https://www.sqlite.org/sqlar.html | fforflo wrote: | > What is the advantage of moving around .sqlite-files, over | just loading the (compressed) WARCs into sqlite databases | when you need them? | | The .warc spec is ideal. I'm not saying we replace it (ref. | xkcd: standards). | | On top of what uniqueid said, "loading" is much slower and | more cumbersome than it sounds. I'm not saying SQLite will | replace text (maybe my aphorism sounded too firm). I'm saying | that maybe along with the .warc.gz archives at rest, one | could have .sql.gz files at rest as well. | | In other words: why not move ACID-compliant archives moving | around? | marginalia_nu wrote: | Seems like the benefit of sqlite is the sort of usecases | where you maybe don't want to load everything you've | crawled into a search engine, but want to be able to | cherrypick the data and retrieve specific documents for | further processing. Which is certainly a use case that | exists, and indeed not really what WARC is designed for. | traverseda wrote: | Great for most end user facing applications though. | marginalia_nu wrote: | End-user facing applications usually don't consume | website crawls, do they? That's impractical for many | reasons, the sheer size alone being perhaps the biggest | obstacle. | | If you want to do something like have an offline copy of | a website, ZIM[1] is a far more suitable format as it's | extremely space-efficient and also fast. | | [1] https://docs.fileformat.com/compression/zim/ | nlohmann wrote: | Have you every played with SQLite virtual tables | (https://sqlite.org/vtab.html) - they could allow to provide an | SQLite interface while keeping the same structure on disk. | Though it requires a bit of work (implementing the interface | can be tedious), it can avoid the conversion in the first | place. | fforflo wrote: | Then again, do you need virtual tables? The .warc structure | won't change, so the tables won't change. But you can have | SQL views defined instead for common queries. | fforflo wrote: | Good point. Actually CommonCrawl provides Parquet files for | their archives too. | | And there's this vtable for Parquet extension. | https://github.com/cldellow/sqlite-parquet-vtable | | But for my use case virtual would be too complicated. ___________________________________________________________________ (page generated 2022-06-19 23:00 UTC)