[HN Gopher] Craziest thing I ever used SQLite for: partial file ... ___________________________________________________________________ Craziest thing I ever used SQLite for: partial file deduplication Author : ics Score : 194 points Date : 2023-03-26 18:00 UTC (4 hours ago) (HTM) web link (sqlite.org) (TXT) w3m dump (sqlite.org) | hultner wrote: | Very cool and probably a fun exercise, but I would probably put | the data on a ZFS volume with dedupe instead, which from reading | this implementation is pretty much the same thing to my layman's | perspective. I could also add compression to the same dataset. | traceroute66 wrote: | > I would probably put the data on a ZFS volume with dedupe | instead | | But doesn't ZFS dedupe eat RAM for breakfast ? Double-digit GB | RAM per TB data IIRC ? | giantrobot wrote: | It's usually along the lines of 1GB per TB. Some factors can | affect that number but I've found it pretty accurate. Note | that's 1GB of RAM that ends up wired to ZFS and never usable | by the rest of the system. | tripleo1 wrote: | ZFS is fun but it locks up for unexplained reasons sometimes | [deleted] | rurban wrote: | Using proper a hash table with a bloom filter would save you the | useless pass through a b-tree though. Much faster and much less | memory | pornel wrote: | With a b-tree you can do even better. Instead of hashing entire | files ahead of time, you can avoid hashing more than the | minimum prefix required to conclude a file is unique, by | building the b-tree lazily. I don't think it can be done with | sqlite though, as it requires hashing the files while | traversing the b-tree. | | https://lib.rs/crates/dupe-krill#readme-nerding-out-about-th... | [deleted] | JackSlateur wrote: | Looks like https://github.com/markfasheh/duperemove | philsnow wrote: | Oh, hey Mark | | (Not (purely) a reference to the "oh hey Mark" meme, I'm | vaguely acquainted with Mark from LUG days) | [deleted] | evmar wrote: | Coincidentally, I just saw that the "USENIX Test of Time Award" | for 2023 was an analysis for deduplication purposes of real-world | files. I found Figure 1 particularly interesting in that partial | deduplication didn't save much over whole-file based | deduplication in practice. | | https://www.usenix.org/conferences/test-of-time-awards | emmelaich wrote: | You'd expect that in general, but for developers updating | content the partial would often win. | anyfoo wrote: | Oh sweet. I've definitely used sqlite (in a zsh script) for | _file_ deduplication. Very simple, mostly just rows consisting of | paths and file content hash. | | But _partial_ file deduplication is something else... | [deleted] | MithrilTuxedo wrote: | This sounds like it could be bolted onto/into rsync on the server | side to present filesystems larger than the server can actually | store. | fbdab103 wrote: | Does anyone else have any other unorthodox use cases? I love | SQLite, and am always happy to ram this square peg into a round | hole. | mastax wrote: | That's really cool. There are a bunch of tools that will let you | symlink or hard link deduplicate files, but being able to do | block-level dedupes simply by leaning on the filesystem is nice. | | It sometimes feels like games are made to thwart this type of | thing. They often use packfiles, basically filesystems within | files optimized to look up assets quickly. Also perhaps they | allowed optimized data layout from when consoles had slow | spinning hard drives. The upshot is that a tiny patch inserting a | line of code in a script may offset hundreds of megabytes of | other data in the packfiles, causing the block hashes to no | longer match up. Do any filesystems model inserts in some way? | I'm pretty sure Steam updates can handle situations like that. I | frequently see updates which download a tiny amount (kilobytes) | but write a huge amount to disk (gigabytes), and I can't think of | any other cause. (Assuming developers aren't using hilariously | un-compressed assets). | mikepurvis wrote: | Yes, Steamworks encourages developers to align assets to | certain boundaries to help ensure efficient patching later, see | the documentation here: | https://partner.steamgames.com/doc/sdk/uploading | mastax wrote: | Interesting, it looks like steam doesn't have any special | handling for inserts/offsets and just handles 1MB blocks. The | extra disk IO probably just comes from rewriting large files: | | > Next - to update the pack file on a client device, | SteamPipe builds the new version alongside the old version. | When all new files are built, it then "commits" the update by | deleting old files and moving the new files in. What this | means is that to update a 25 GB pack file, SteamPipe will | always build a new, 25GB file. If the update only requires 10 | bytes of changes to that file, SteamPipe will need to copy | almost the entire 25GB from the old file to the new file. | Depending on the client storage hardware, this can be a very | slow process. | bob1029 wrote: | I think SQLite + WAL could be a compelling solution for | game patching scenario. | | There are already SQLite clustering technologies that | utilize WAL frame shipping in order to replicate data. I | don't see why a steam patch couldn't just be a range of WAL | frames to be applied to a client's database. | beecafe wrote: | What about using content defined chunking? Then inserting a few | lines shouldn't cause all the subsequent chunks to shift | viraptor wrote: | That's not quite right. The contents in the file would still | shift. You'd just be able to still deduplicate most chunks | between multiple files... But only in your copies because the | filesystem still uses constant size chunks. | riceart wrote: | But I think they're talking about if you're doing content | defined chunking of a pack or archive file - inserting data | should only affect insertion chunks + 1. And since it's | content defined - those chunks are necessarily not constant | size. | | Ie this is how backup tools like Arq or duplicacy work. | panzi wrote: | Some games just add an update pak that overloads the changed | files, not changing the original pak file. So the installation | size of such games might grow a lot on updates, wasting a lot | of space by keeping the old files, but the files are immutable | at least and rollbacks of updates are theoretically possible. | sgtnoodle wrote: | That's neat! In retrospect, did the database make the problem | more tractable from an algorithmic or practical standpoint, or | was it mostly just using the tools you're familiar with? If I | were to approach the same problem, I likely would have tried to | keep all the data in RAM and serialize it out to a file for | incremental runs. | nonethewiser wrote: | I have used SQLite to deduplicate csvs when it couldn't all fit | into memory. | | I imagine Dask would be a better tool for this job but I wasn't | familiar with it at the time and it did allow me to deduplicate | using Python and no external dependencies. | qsort wrote: | IME a database does not algorithmically improve anything, but | it does make at your disposal some very efficient data | structures that are definitely nontrivial to replicate. This is | especially the case if you have to work with datasets that | don't fit in memory! | | An added bonus is that you can use SQL. | zerkten wrote: | A friend works for a financial company that gets applicants | from .NET and Linux backgrounds. He observed that the .NET | devs almost always thought about solving these problems with | a database and SQL that the Linux people would solve with | pipelines of commands. | | Both were valid in his environment since they were both a big | Oracle and SQL Server shop while having high Linux usage as | well. I must ask him if PowerShell has changed things at all. | theamk wrote: | Definitely the latter. | | There are many of other products which implement "keep track of | partial file contents to deduplicate", for example casync borg | and restic. | | None of them use sqlite for metadata storage, which kinda makes | sense - in author's schema, each block uses (fileid, | block_index, hash) tuple, which is 2/3 wasted space, as fileids | are mostly same (if files are big) and block_index is always | increasing. A custom data structure would be both faster and | more efficient. | [deleted] | maxyurk wrote: | https://en.m.wikipedia.org/wiki/Content-addressable_storage | groestl wrote: | Also relevant in the context of partial file deduplication (to | get robustness wrt inserts and deletions): | https://en.wikipedia.org/wiki/Rolling_hash | megous wrote: | I used a similar trick to stuff 14 Linux distros onto a 6 GiB | sized SD card image using btrfs filesystem. The distros share a | lot of common files, so this works well to save space. (btrfs | also supports the data block sharing CoW feature as APFS) | | https://xnux.eu/p-boot-demo/ | | I had to write a custom tool to do it efficiently during image | creation. | itsthecourier wrote: | You, savage. Kudos. | tripleo1 wrote: | Slightly off topic -> there was a project I seen on gh that | claimed to support system administration using a relational | tables. Something like everything in SQL. I thought it might be a | cool idea. | coppsilgold wrote: | osquery <https://github.com/osquery/osquery> | leetbulb wrote: | "There is no file path, because macOS lets you look up files by | id. The hash values are cryptographic hashes truncated to 64 bits | and reinterpreted as integers." | | Is the author implying that APFS or HFS uses this method to | calculate the file ID? I am unable to find any information | regarding this. From what I understand, w.r.t APFS, the file ID | is a combination of the inode OID and genID. | cerved wrote: | I assumed they are taking about inodes | [deleted] ___________________________________________________________________ (page generated 2023-03-26 23:00 UTC)