[HN Gopher] CDC File Transfer
       ___________________________________________________________________
        
       CDC File Transfer
        
       Author : cglong
       Score  : 131 points
       Date   : 2023-01-08 21:46 UTC (1 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | teraflop wrote:
       | TL;DR they rediscovered the context-dependent variable block size
       | technique that Tarsnap uses:
       | https://www.tarsnap.com/download/EuroBSDCon13.pdf
        
         | crazygringo wrote:
         | They don't claim to have invented CDC, or FastCDC, they just
         | made and are sharing a useful implementation of it.
         | 
         | And if that Tarsnap presentation is from 2013, and FastCDC was
         | published in 2016 [1] according to Wikipedia [2], then
         | presumably Tarsnap didn't invent FastCDC either.
         | 
         | [1]
         | https://www.usenix.org/system/files/conference/atc16/atc16-p...
         | 
         | [2]
         | https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_...
        
           | jyxent wrote:
           | Another well-cited predecessor is "A low-bandwidth network
           | file system."
           | (https://dl.acm.org/doi/abs/10.1145/502034.502052), which was
           | published in 2001. It uses Rabin fingerprinting to define
           | chunk boundaries.
        
         | kccqzy wrote:
         | The Tarsnap technique is interesting but that presentation is a
         | bit hard to follow. What are the examples for values alpha and
         | p?
        
       | elromulous wrote:
       | I took a quick glance at the repo, it wasn't totally clear to me
       | if the client is windows only. Anyone know more?
        
         | Moissanite wrote:
         | I thought the same; seems to be perfectly fine as a Linux-to-
         | Linux tool.
        
       | m000 wrote:
       | It would be interesting to also try this on more underpowered
       | systems. E.g. aging armv5-based NAS boxes. Would we have similar
       | performance improvements, or any improvements would be hampered
       | by poor I/O performance and architectural inefficiencies?
        
       | zX41ZdbW wrote:
       | Content Defined Chunking is one of my favorite algorithms because
       | it has some "magic" similar to HyperLogLogs, Bloom filters,
       | etc... This algorithm is good to explain to people, to get them
       | inspired by computer science. I usually explain the simplest
       | variant with rolling hashes.
       | 
       | It is interesting what the result will be (average saving on
       | deduplication) if it is applied globally to a large-scale blob
       | storage, such as Amazon S3 or Google Drive (we need metadata
       | storage about chunks, and the chunks can be deduplicated).
       | 
       | PS. I don't use this algorithm in ClickHouse, but it always
       | remains tempting.
        
         | piceas wrote:
         | Implementing fastCDC is fun (2016).
         | 
         | Do you have a suggestion on what to read on the topic since
         | then?
         | 
         | I don't keep up with these things. A quick search came up with
         | the following but I haven't read it yet.
         | 
         | Fan Ni and Song Jiang, "RapidCDC: Leveraging Duplicate Locality
         | to Accelerate Chunking in CDC-based Deduplication Systems", in
         | Proceedings of 2019 ACM Symposium on Cloud Computing (ACM
         | SoCC'19), Santa Cruz, CA, November, 2019.
         | 
         | https://ranger.uta.edu/~sjiang/index.html
        
           | windowsworkstoo wrote:
           | Microsoft have this for DFS-R and the spec is open
           | https://learn.microsoft.com/en-
           | us/openspecs/windows_protocol... - its pretty straight
           | forward to implement.
        
       | sigmonsays wrote:
       | can we gut rsync and implement this?
       | 
       | rsync doesn't even have a great protocol specification... maybe
       | it's a new tool...
        
         | progbits wrote:
         | > Overall, cdc_rsync syncs files about 3 times faster than
         | Cygwin rsync.
         | 
         | Sounds promising.
        
         | mjevans wrote:
         | I'd like additional info about the Content Defined Chunking
         | part of the specification.
         | 
         | How is the content defined? Where I'd try to begin is with a
         | single pass that looks for runs of 'null' ('\0') bytes, even
         | one long, as potential boundary ends. During that pass also
         | look for 'magic signatures' for known stream types like already
         | compressed content streams (all the more so to just not try
         | compressing anyway). The CDC might also be aware of some file
         | structures, zip, 7z, tar, etc; and have a dedicated segment
         | creation algorithm for them. At a low level, the two ends
         | should exchange a list of segment offsets, lengths, checksum
         | (partial?) and maybe some short fragment of bytes to check.
         | (E.G. 4 byte chunks at various powers of 2 offsets or major
         | chunk starts.) Where the two ends have differences in chunks
         | existing they might also expend some minor additional effort to
         | investigate if the chunks that were identified on the other
         | side exist locally; in case the two versions are using
         | different filters or happened to reach different conclusions.
        
       | 5440 wrote:
       | I'm currently working on a required CDC (Center for Disease
       | Control) reporting function for a COVID test. For a second I
       | thought this article was going to be extremely helpful.
        
       | brendoncarroll wrote:
       | FastCDC is the same chunking algorithm used in Got.
       | 
       | https://github.com/gotvc/got
        
         | wahern wrote:
         | Well, that's confusing--there's another Git follow on project
         | from some OpenBSD developers also called Got:
         | http://gameoftrees.org/
         | 
         | They both seem like very cool projects, so may the best Got
         | win!
        
         | jws wrote:
         | To elaborate, rsync chunks in fixed sizes, so inserting or
         | deleting a few bytes makes all different chunks from that point
         | onward.
         | 
         | If instead you chunk based off of local content (conceptually
         | like chunking text into sentences at periods, but its a binary
         | thing on has an upper size limit and lower size limit and I
         | couldn't find the algorithm specification) so that after an
         | insertion or deletion in a small number of bytes you start
         | getting the same chunks as before.
         | 
         | This drastically reduces the cost of identifying unmodified
         | chunks.
        
           | brendoncarroll wrote:
           | I don't think rsync uses fixed sized chunks. The algorithm is
           | described here; it's a rolling hash.
           | 
           | https://rsync.samba.org/tech_report/node3.html
           | 
           | Your description of content defined chunking is exactly right
           | though. There are a number of techniques for doing it.
           | FastCDC is one of them, although not the one used in rsync.
           | 
           | https://en.wikipedia.org/wiki/Rolling_hash
           | 
           | EDIT: Corrected in the comments below. Fixed sized chunks
           | searched for at any offset with a rolling hash. The rsync
           | algorithm description is here.
           | 
           | https://rsync.samba.org/tech_report/node2.html
        
             | teraflop wrote:
             | rsync does use fixed-size chunks, but the rolling hash
             | allows them to be identified even at non-integer chunk
             | offsets.
             | 
             | So a change partway through the file doesn't force rsync to
             | actually re-transfer all of the subsequent unmodified
             | chunks, but it does incur a computational cost to find them
             | since it has to search through all possible offsets.
        
       | Waterluvian wrote:
       | Maybe I'm projecting (probably) but I detect some subtext in the
       | README's language. Like a "developers who are irate that Stadia
       | was shut down, wanting to free anything they can for the
       | community" type feeling.
        
         | elromulous wrote:
         | Google / abc has a solid track record of open sourcing shutdown
         | projects. See Makani[1] and iirc also Loon.
         | 
         | [1] https://news.ycombinator.com/item?id=24456613
        
           | Waterluvian wrote:
           | Aha! I wasn't at all aware of that. Okay, this reads more
           | like that to me second time around. Just excitement to open
           | source a cool piece of tech that came from Stadia.
        
       | [deleted]
        
       | encryptluks2 wrote:
       | This is great for syncing, but what about separating courgette
       | from Chromium so that we can finally have a descent delta diff
       | program? I am tired of Windows community relying on SmartVersion
       | to create ISO diff files just because you need to be a master at
       | compiling Chromium just to use courgette.
        
       ___________________________________________________________________
       (page generated 2023-01-08 23:00 UTC)