hngopher.com

       [HN Gopher] Craziest thing I ever used SQLite for: partial file ...
       ___________________________________________________________________
        
       Craziest thing I ever used SQLite for: partial file deduplication
        
       Author : ics
       Score  : 194 points
       Date   : 2023-03-26 18:00 UTC (4 hours ago)
        
 (HTM) web link (sqlite.org)
 (TXT) w3m dump (sqlite.org)
        
       | hultner wrote:
       | Very cool and probably a fun exercise, but I would probably put
       | the data on a ZFS volume with dedupe instead, which from reading
       | this implementation is pretty much the same thing to my layman's
       | perspective. I could also add compression to the same dataset.
        
         | traceroute66 wrote:
         | > I would probably put the data on a ZFS volume with dedupe
         | instead
         | 
         | But doesn't ZFS dedupe eat RAM for breakfast ? Double-digit GB
         | RAM per TB data IIRC ?
        
           | giantrobot wrote:
           | It's usually along the lines of 1GB per TB. Some factors can
           | affect that number but I've found it pretty accurate. Note
           | that's 1GB of RAM that ends up wired to ZFS and never usable
           | by the rest of the system.
        
         | tripleo1 wrote:
         | ZFS is fun but it locks up for unexplained reasons sometimes
        
       | [deleted]
        
       | rurban wrote:
       | Using proper a hash table with a bloom filter would save you the
       | useless pass through a b-tree though. Much faster and much less
       | memory
        
         | pornel wrote:
         | With a b-tree you can do even better. Instead of hashing entire
         | files ahead of time, you can avoid hashing more than the
         | minimum prefix required to conclude a file is unique, by
         | building the b-tree lazily. I don't think it can be done with
         | sqlite though, as it requires hashing the files while
         | traversing the b-tree.
         | 
         | https://lib.rs/crates/dupe-krill#readme-nerding-out-about-th...
        
       | [deleted]
        
       | JackSlateur wrote:
       | Looks like https://github.com/markfasheh/duperemove
        
         | philsnow wrote:
         | Oh, hey Mark
         | 
         | (Not (purely) a reference to the "oh hey Mark" meme, I'm
         | vaguely acquainted with Mark from LUG days)
        
       | [deleted]
        
       | evmar wrote:
       | Coincidentally, I just saw that the "USENIX Test of Time Award"
       | for 2023 was an analysis for deduplication purposes of real-world
       | files. I found Figure 1 particularly interesting in that partial
       | deduplication didn't save much over whole-file based
       | deduplication in practice.
       | 
       | https://www.usenix.org/conferences/test-of-time-awards
        
         | emmelaich wrote:
         | You'd expect that in general, but for developers updating
         | content the partial would often win.
        
       | anyfoo wrote:
       | Oh sweet. I've definitely used sqlite (in a zsh script) for
       | _file_ deduplication. Very simple, mostly just rows consisting of
       | paths and file content hash.
       | 
       | But _partial_ file deduplication is something else...
        
       | [deleted]
        
       | MithrilTuxedo wrote:
       | This sounds like it could be bolted onto/into rsync on the server
       | side to present filesystems larger than the server can actually
       | store.
        
       | fbdab103 wrote:
       | Does anyone else have any other unorthodox use cases? I love
       | SQLite, and am always happy to ram this square peg into a round
       | hole.
        
       | mastax wrote:
       | That's really cool. There are a bunch of tools that will let you
       | symlink or hard link deduplicate files, but being able to do
       | block-level dedupes simply by leaning on the filesystem is nice.
       | 
       | It sometimes feels like games are made to thwart this type of
       | thing. They often use packfiles, basically filesystems within
       | files optimized to look up assets quickly. Also perhaps they
       | allowed optimized data layout from when consoles had slow
       | spinning hard drives. The upshot is that a tiny patch inserting a
       | line of code in a script may offset hundreds of megabytes of
       | other data in the packfiles, causing the block hashes to no
       | longer match up. Do any filesystems model inserts in some way?
       | I'm pretty sure Steam updates can handle situations like that. I
       | frequently see updates which download a tiny amount (kilobytes)
       | but write a huge amount to disk (gigabytes), and I can't think of
       | any other cause. (Assuming developers aren't using hilariously
       | un-compressed assets).
        
         | mikepurvis wrote:
         | Yes, Steamworks encourages developers to align assets to
         | certain boundaries to help ensure efficient patching later, see
         | the documentation here:
         | https://partner.steamgames.com/doc/sdk/uploading
        
           | mastax wrote:
           | Interesting, it looks like steam doesn't have any special
           | handling for inserts/offsets and just handles 1MB blocks. The
           | extra disk IO probably just comes from rewriting large files:
           | 
           | > Next - to update the pack file on a client device,
           | SteamPipe builds the new version alongside the old version.
           | When all new files are built, it then "commits" the update by
           | deleting old files and moving the new files in. What this
           | means is that to update a 25 GB pack file, SteamPipe will
           | always build a new, 25GB file. If the update only requires 10
           | bytes of changes to that file, SteamPipe will need to copy
           | almost the entire 25GB from the old file to the new file.
           | Depending on the client storage hardware, this can be a very
           | slow process.
        
             | bob1029 wrote:
             | I think SQLite + WAL could be a compelling solution for
             | game patching scenario.
             | 
             | There are already SQLite clustering technologies that
             | utilize WAL frame shipping in order to replicate data. I
             | don't see why a steam patch couldn't just be a range of WAL
             | frames to be applied to a client's database.
        
         | beecafe wrote:
         | What about using content defined chunking? Then inserting a few
         | lines shouldn't cause all the subsequent chunks to shift
        
           | viraptor wrote:
           | That's not quite right. The contents in the file would still
           | shift. You'd just be able to still deduplicate most chunks
           | between multiple files... But only in your copies because the
           | filesystem still uses constant size chunks.
        
             | riceart wrote:
             | But I think they're talking about if you're doing content
             | defined chunking of a pack or archive file - inserting data
             | should only affect insertion chunks + 1. And since it's
             | content defined - those chunks are necessarily not constant
             | size.
             | 
             | Ie this is how backup tools like Arq or duplicacy work.
        
         | panzi wrote:
         | Some games just add an update pak that overloads the changed
         | files, not changing the original pak file. So the installation
         | size of such games might grow a lot on updates, wasting a lot
         | of space by keeping the old files, but the files are immutable
         | at least and rollbacks of updates are theoretically possible.
        
       | sgtnoodle wrote:
       | That's neat! In retrospect, did the database make the problem
       | more tractable from an algorithmic or practical standpoint, or
       | was it mostly just using the tools you're familiar with? If I
       | were to approach the same problem, I likely would have tried to
       | keep all the data in RAM and serialize it out to a file for
       | incremental runs.
        
         | nonethewiser wrote:
         | I have used SQLite to deduplicate csvs when it couldn't all fit
         | into memory.
         | 
         | I imagine Dask would be a better tool for this job but I wasn't
         | familiar with it at the time and it did allow me to deduplicate
         | using Python and no external dependencies.
        
         | qsort wrote:
         | IME a database does not algorithmically improve anything, but
         | it does make at your disposal some very efficient data
         | structures that are definitely nontrivial to replicate. This is
         | especially the case if you have to work with datasets that
         | don't fit in memory!
         | 
         | An added bonus is that you can use SQL.
        
           | zerkten wrote:
           | A friend works for a financial company that gets applicants
           | from .NET and Linux backgrounds. He observed that the .NET
           | devs almost always thought about solving these problems with
           | a database and SQL that the Linux people would solve with
           | pipelines of commands.
           | 
           | Both were valid in his environment since they were both a big
           | Oracle and SQL Server shop while having high Linux usage as
           | well. I must ask him if PowerShell has changed things at all.
        
         | theamk wrote:
         | Definitely the latter.
         | 
         | There are many of other products which implement "keep track of
         | partial file contents to deduplicate", for example casync borg
         | and restic.
         | 
         | None of them use sqlite for metadata storage, which kinda makes
         | sense - in author's schema, each block uses (fileid,
         | block_index, hash) tuple, which is 2/3 wasted space, as fileids
         | are mostly same (if files are big) and block_index is always
         | increasing. A custom data structure would be both faster and
         | more efficient.
        
       | [deleted]
        
       | maxyurk wrote:
       | https://en.m.wikipedia.org/wiki/Content-addressable_storage
        
         | groestl wrote:
         | Also relevant in the context of partial file deduplication (to
         | get robustness wrt inserts and deletions):
         | https://en.wikipedia.org/wiki/Rolling_hash
        
       | megous wrote:
       | I used a similar trick to stuff 14 Linux distros onto a 6 GiB
       | sized SD card image using btrfs filesystem. The distros share a
       | lot of common files, so this works well to save space. (btrfs
       | also supports the data block sharing CoW feature as APFS)
       | 
       | https://xnux.eu/p-boot-demo/
       | 
       | I had to write a custom tool to do it efficiently during image
       | creation.
        
         | itsthecourier wrote:
         | You, savage. Kudos.
        
       | tripleo1 wrote:
       | Slightly off topic -> there was a project I seen on gh that
       | claimed to support system administration using a relational
       | tables. Something like everything in SQL. I thought it might be a
       | cool idea.
        
         | coppsilgold wrote:
         | osquery <https://github.com/osquery/osquery>
        
       | leetbulb wrote:
       | "There is no file path, because macOS lets you look up files by
       | id. The hash values are cryptographic hashes truncated to 64 bits
       | and reinterpreted as integers."
       | 
       | Is the author implying that APFS or HFS uses this method to
       | calculate the file ID? I am unable to find any information
       | regarding this. From what I understand, w.r.t APFS, the file ID
       | is a combination of the inode OID and genID.
        
         | cerved wrote:
         | I assumed they are taking about inodes
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-03-26 23:00 UTC)