[HN Gopher] CDC File Transfer ___________________________________________________________________ CDC File Transfer Author : cglong Score : 131 points Date : 2023-01-08 21:46 UTC (1 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | teraflop wrote: | TL;DR they rediscovered the context-dependent variable block size | technique that Tarsnap uses: | https://www.tarsnap.com/download/EuroBSDCon13.pdf | crazygringo wrote: | They don't claim to have invented CDC, or FastCDC, they just | made and are sharing a useful implementation of it. | | And if that Tarsnap presentation is from 2013, and FastCDC was | published in 2016 [1] according to Wikipedia [2], then | presumably Tarsnap didn't invent FastCDC either. | | [1] | https://www.usenix.org/system/files/conference/atc16/atc16-p... | | [2] | https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_... | jyxent wrote: | Another well-cited predecessor is "A low-bandwidth network | file system." | (https://dl.acm.org/doi/abs/10.1145/502034.502052), which was | published in 2001. It uses Rabin fingerprinting to define | chunk boundaries. | kccqzy wrote: | The Tarsnap technique is interesting but that presentation is a | bit hard to follow. What are the examples for values alpha and | p? | elromulous wrote: | I took a quick glance at the repo, it wasn't totally clear to me | if the client is windows only. Anyone know more? | Moissanite wrote: | I thought the same; seems to be perfectly fine as a Linux-to- | Linux tool. | m000 wrote: | It would be interesting to also try this on more underpowered | systems. E.g. aging armv5-based NAS boxes. Would we have similar | performance improvements, or any improvements would be hampered | by poor I/O performance and architectural inefficiencies? | zX41ZdbW wrote: | Content Defined Chunking is one of my favorite algorithms because | it has some "magic" similar to HyperLogLogs, Bloom filters, | etc... This algorithm is good to explain to people, to get them | inspired by computer science. I usually explain the simplest | variant with rolling hashes. | | It is interesting what the result will be (average saving on | deduplication) if it is applied globally to a large-scale blob | storage, such as Amazon S3 or Google Drive (we need metadata | storage about chunks, and the chunks can be deduplicated). | | PS. I don't use this algorithm in ClickHouse, but it always | remains tempting. | piceas wrote: | Implementing fastCDC is fun (2016). | | Do you have a suggestion on what to read on the topic since | then? | | I don't keep up with these things. A quick search came up with | the following but I haven't read it yet. | | Fan Ni and Song Jiang, "RapidCDC: Leveraging Duplicate Locality | to Accelerate Chunking in CDC-based Deduplication Systems", in | Proceedings of 2019 ACM Symposium on Cloud Computing (ACM | SoCC'19), Santa Cruz, CA, November, 2019. | | https://ranger.uta.edu/~sjiang/index.html | windowsworkstoo wrote: | Microsoft have this for DFS-R and the spec is open | https://learn.microsoft.com/en- | us/openspecs/windows_protocol... - its pretty straight | forward to implement. | sigmonsays wrote: | can we gut rsync and implement this? | | rsync doesn't even have a great protocol specification... maybe | it's a new tool... | progbits wrote: | > Overall, cdc_rsync syncs files about 3 times faster than | Cygwin rsync. | | Sounds promising. | mjevans wrote: | I'd like additional info about the Content Defined Chunking | part of the specification. | | How is the content defined? Where I'd try to begin is with a | single pass that looks for runs of 'null' ('\0') bytes, even | one long, as potential boundary ends. During that pass also | look for 'magic signatures' for known stream types like already | compressed content streams (all the more so to just not try | compressing anyway). The CDC might also be aware of some file | structures, zip, 7z, tar, etc; and have a dedicated segment | creation algorithm for them. At a low level, the two ends | should exchange a list of segment offsets, lengths, checksum | (partial?) and maybe some short fragment of bytes to check. | (E.G. 4 byte chunks at various powers of 2 offsets or major | chunk starts.) Where the two ends have differences in chunks | existing they might also expend some minor additional effort to | investigate if the chunks that were identified on the other | side exist locally; in case the two versions are using | different filters or happened to reach different conclusions. | 5440 wrote: | I'm currently working on a required CDC (Center for Disease | Control) reporting function for a COVID test. For a second I | thought this article was going to be extremely helpful. | brendoncarroll wrote: | FastCDC is the same chunking algorithm used in Got. | | https://github.com/gotvc/got | wahern wrote: | Well, that's confusing--there's another Git follow on project | from some OpenBSD developers also called Got: | http://gameoftrees.org/ | | They both seem like very cool projects, so may the best Got | win! | jws wrote: | To elaborate, rsync chunks in fixed sizes, so inserting or | deleting a few bytes makes all different chunks from that point | onward. | | If instead you chunk based off of local content (conceptually | like chunking text into sentences at periods, but its a binary | thing on has an upper size limit and lower size limit and I | couldn't find the algorithm specification) so that after an | insertion or deletion in a small number of bytes you start | getting the same chunks as before. | | This drastically reduces the cost of identifying unmodified | chunks. | brendoncarroll wrote: | I don't think rsync uses fixed sized chunks. The algorithm is | described here; it's a rolling hash. | | https://rsync.samba.org/tech_report/node3.html | | Your description of content defined chunking is exactly right | though. There are a number of techniques for doing it. | FastCDC is one of them, although not the one used in rsync. | | https://en.wikipedia.org/wiki/Rolling_hash | | EDIT: Corrected in the comments below. Fixed sized chunks | searched for at any offset with a rolling hash. The rsync | algorithm description is here. | | https://rsync.samba.org/tech_report/node2.html | teraflop wrote: | rsync does use fixed-size chunks, but the rolling hash | allows them to be identified even at non-integer chunk | offsets. | | So a change partway through the file doesn't force rsync to | actually re-transfer all of the subsequent unmodified | chunks, but it does incur a computational cost to find them | since it has to search through all possible offsets. | Waterluvian wrote: | Maybe I'm projecting (probably) but I detect some subtext in the | README's language. Like a "developers who are irate that Stadia | was shut down, wanting to free anything they can for the | community" type feeling. | elromulous wrote: | Google / abc has a solid track record of open sourcing shutdown | projects. See Makani[1] and iirc also Loon. | | [1] https://news.ycombinator.com/item?id=24456613 | Waterluvian wrote: | Aha! I wasn't at all aware of that. Okay, this reads more | like that to me second time around. Just excitement to open | source a cool piece of tech that came from Stadia. | [deleted] | encryptluks2 wrote: | This is great for syncing, but what about separating courgette | from Chromium so that we can finally have a descent delta diff | program? I am tired of Windows community relying on SmartVersion | to create ISO diff files just because you need to be a master at | compiling Chromium just to use courgette. ___________________________________________________________________ (page generated 2023-01-08 23:00 UTC)