[HN Gopher] How does rsync work? ___________________________________________________________________ How does rsync work? Author : secure Score : 228 points Date : 2022-07-02 12:41 UTC (10 hours ago) (HTM) web link (michael.stapelberg.ch) (TXT) w3m dump (michael.stapelberg.ch) | lloeki wrote: | When trying to understand rsync and the rolling checksum I | stumbled upon a small python implementation in some self-hosted | corner of the web[0], which I have archived on GH[1] (not the | author, but things can vanish quickly, as proved by the bzr repo | which went _poof_ [2]). | | [0]: https://blog.liw.fi/posts/rsync-in-python/ | | [1]: https://github.com/lloeki/rsync/blob/master/rsync.py | | [2]: https://code.liw.fi/obsync/bzr/trunk/ | mmlb wrote: | For bzr did you try archive.org? | https://web.archive.org/web/20150321194412/https://code.liw.... | generalizations wrote: | It's not actually there. | | https://web.archive.org/web/20150321212547/http://code.liw.f. | .. | jwilk wrote: | I don't think you can clone that. | lazypenguin wrote: | Nice write up. rsync is great as an application but I found it | more cumbersome to use when wanting to integrate it into my own | application. There's librsync but the documentation is threadbare | and it requires an rsync server to run. I found bita/bitar | (https://github.com/oll3/bita) which is inspired by rsync & | family. It works more like zsync which leverages HTTP Range | requests so it doesn't require anything running on the server to | get chunks. Works like a treat using s3/b2 storage to serve files | and get incremental differential updates on the client side! | oll3 wrote: | Always happy to see my pet project mentioned (bita) and that it | is actually being used by others than me :) | bigChris wrote: | Rsync worst issue is someone port scanning and brute force their | way into your system. Turn off your port. | hatware wrote: | Don't most consumer routers have all ports blocked? Who is | connecting a computer directly to the modem these days? | stavros wrote: | Most of them do NAT, which isn't semantically the same as | blocking all ports, but functionally it is. | seunosewa wrote: | Most people use rsync over ssh | srvmshr wrote: | I encountered a strange situation 2 days ago. I rsync my pdf | files periodically between my harddrives. rsync showed no | differences between two folder trees, but if I did `diff -r` | between the two, 3 pdfs came out different. | | I checked the three individually but they showed no corruption or | changes either side. How can this happen? | | Edit: the hard drive copy is previously rsynced from this copy & | both copies are mirrored with google cloud bucket. | | The 3 files which showed different have the same MD5 checksum | cliZX81 wrote: | The same happend to me syncing to an SD card. The reason may be | different timestamp resolutions, which make same files look | different. I synced from HFS+ to FAT32 back then. | orangepurple wrote: | rsync uses a heuristic based on file times and sizes to compare | files. to compare file content use the --checksum feature | (computationally expensive to run) | Dylan16807 wrote: | Rsync can checksum a lot of megabytes per second. In general | I'd say the disk IO is much more expensive than the | computation. | srvmshr wrote: | Yes but it doesn't answer why rsync & checksum pass the set | of files as same, but diff reports them different. | pronoiac wrote: | It isn't clear from the conversation so far: do you use "-- | checksum"? | samoppy wrote: | boomskats wrote: | This was a great write up. I've already sent it to a few people. | | On the question of what happens if a file's contents change after | the initial checksum, the man page for rsync[0] has an | interesting explanation of the *--checksum* option: | | > This changes the way rsync checks if the files have been | changed and are in need of a transfer. Without this option, rsync | uses a "quick check" that (by default) checks if each file's size | and time of last modification match between the sender and | receiver. This option changes this to compare a 128-bit checksum | for each file that has a matching size. Generating the checksums | means that both sides will expend a lot of disk I/O reading all | the data in the files in the transfer (and this is prior to any | reading that will be done to transfer changed files), so this can | slow things down significantly. | | > The sending side generates its checksums while it is doing the | file-system scan that builds the list of the available files. The | receiver generates its checksums when it is scanning for changed | files, and will checksum any file that has the same size as the | corresponding sender's file: files with either a changed size or | a changed checksum are selected for transfer. | | > Note that rsync always verifies that each transferred file was | correctly reconstructed on the receiving side by checking a | whole-file checksum that is generated as the file is transferred, | but that automatic after-the-transfer verification has nothing to | do with this option's before-the-transfer "Does this file need to | be updated?" check. For protocol 30 and beyond (first supported | in 3.0.0), the checksum used is MD5. For older protocols, the | checksum used is MD4. | | [0]: https://linux.die.net/man/1/rsync | londons_explore wrote: | Failure cases of the 'quick check': | | * Underlying disk device corruption - but modern disks do | internal error checking, and should emit an IO error. | | * Corruption in RAM/software bug in the kernel IO subsystem. | Should be detected by filesystem checksumming. | | * User has accidentally modified file and set mtime back. | _fixes this case_. | | * User has maliciously modified file and set mtime back. Since | it's MD5 (broken), the malicious user can make the checksum | match too. _checksumming doesn 't help_. | | Given that, I see no users who really benefit from | checksumming. It isn't sufficient for anyone with really high | data integrity requirements, while also being overkill for | typical usecases. | naniwaduni wrote: | > * User has maliciously modified file and set mtime back. | Since it's MD5 (broken), the malicious user can make the | checksum match too. checksumming doesn't help. | | No, md5 is not broken like that (yet). There is no nkown | second-preimage attack against md5; the practical collision | vulns only affect cases where an attack controls the file | content both before _and_ after update. | tjoff wrote: | The very existence of filesystem checksumming is because your | first point isn't always true. | | Also, filesystem checksumming does not guard against | ram/kernel-bugs. On top of that file system checksumming is | very rare. | jwilk wrote: | > https://linux.die.net/man/1/rsync | | linux.die.net is horribly outdated. This particular page is | from 2009. | | Up-to-date docs are here: | | https://download.samba.org/pub/rsync/rsync.1 | secure wrote: | Which version is the samba one? Latest release? Git? | | If you want to see the man page of the version in Debian, | that would be | https://manpages.debian.org/testing/rsync/rsync.1.en.html | | Disclaimer: I wrote the software behind manpages.debian.org | :) | jwilk wrote: | "This manpage is current for version 3.2.5dev of rsync" - | so I guess it's from git. | kzrdude wrote: | I guess zfs send and similar are better solutions, but what if | we could query the filesystem for existing checksums of a file | and save IO that way, if filesystems on both sides already | stored usable checksums? | formerly_proven wrote: | Unless you are also doing FS-level deduplication using the | same checksums, it generally makes no sense for these to be | cryptographic hashes, so they're not necessarily suitable for | this purpose. | | IIRC neither ZFS nor btrfs use cryptographic hashes for | checksumming by default. | throw0101a wrote: | > _on is a short hand for fletcher4 for non-deduped | datasets and sha256 for deduped datasets_ | | * https://openzfs.github.io/openzfs- | docs/Basic%20Concepts/Chec... | | * https://people.freebsd.org/~asomers/fletcher.pdf | | * https://en.wikipedia.org/wiki/Fletcher%27s_checksum | | Strangely enough SHA-512 is actually (50%) faster than | -256: | | > _ZFS actually uses a special version of SHA512 called | SHA512t256, it uses a different initial value, and | truncates the results to 256 bits (that is all the room | there is in the block pointer). The advantage is only that | it is faster on 64 bit CPUs._ | | * https://twitter.com/klarainc/status/1367199461546614788 | slavik81 wrote: | It seems that SHA512t256 is another name for SHA-512/256. | It's a shame that the initialization is different from | SHA-512, as it would have been very useful to be able to | convert a SHA-512 hash into a SHA-512/256 hash. | [deleted] | js2 wrote: | > I guess zfs send and similar are better solutions. | | It depends. I recently built a new zfs pool server and needed | to transfer a few TB of data from the old pool to the new | pool, but I built the new pool with a larger record size. If | I'd used zfs send the files would have retained their | existing record size. So rsync it was. | throw0101a wrote: | > _For protocol 30 and beyond (first supported in 3.0.0), the | checksum used is MD5. For older protocols, the checksum used is | MD4._ | | Newer versions (>=3.2?) support xxHash and xxHash3: | | * https://github.com/WayneD/rsync/blob/master/checksum.c | | * https://github.com/Cyan4973/xxHash | | * https://news.ycombinator.com/item?id=19402602 (2019 XXH3 | discussion) | aidenn0 wrote: | I thought xxHash was only used for the chunk hash, not the | whole file hash? | throw0101a wrote: | This is also available as a video, "Why I wrote my own rsync": | | * https://media.ccc.de/v/gpn20-41-why-i-wrote-my-own-rsync | | * https://www.youtube.com/watch?v=wpwObdgemoE | throw0101a wrote: | See also the 1996 original paper by Tridgell (also of Samba fame) | and Mackerras: | | * https://rsync.samba.org/tech_report/ | | * https://www.andrew.cmu.edu/course/15-749/READINGS/required/c... | secure wrote: | Another great resource from Tridgell himself is this Ottawa | Linux Symposium talk: | http://olstrans.sourceforge.net/release/OLS2000-rsync/OLS200... | CamperBob2 wrote: | I don't see why any of this is needed. Just install Dropbox, | and... | LambdaComplex wrote: | I'm gonna interpret this in the best faith possible, assume | you're referencing the infamous Dropbox HN comment, upvote you | to counteract the downvotes from people who missed the joke, | and link to the aforementioned comment: | https://news.ycombinator.com/item?id=9224 | CamperBob2 wrote: | The irony is that BrandonM's classic HN comment makes _more_ | sense these days, as Dropbox continues its evolution towards | Microsofthood. Increasingly, using Dropbox means putting up | with an unceasing deluge of product promotions while you 're | trying to get your work done. No such annoyances with rsync | and ftp and the like. | | Just last week, Dropbox unilaterally decided I didn't want | local copies of the shared files on my laptop, which made for | some awkwardness inside a secured facility with no Internet | access. | wardedVibe wrote: | Let them use rsync for you? ___________________________________________________________________ (page generated 2022-07-02 23:00 UTC)