[HN Gopher] How does rsync work?
       ___________________________________________________________________
        
       How does rsync work?
        
       Author : secure
       Score  : 228 points
       Date   : 2022-07-02 12:41 UTC (10 hours ago)
        
 (HTM) web link (michael.stapelberg.ch)
 (TXT) w3m dump (michael.stapelberg.ch)
        
       | lloeki wrote:
       | When trying to understand rsync and the rolling checksum I
       | stumbled upon a small python implementation in some self-hosted
       | corner of the web[0], which I have archived on GH[1] (not the
       | author, but things can vanish quickly, as proved by the bzr repo
       | which went _poof_ [2]).
       | 
       | [0]: https://blog.liw.fi/posts/rsync-in-python/
       | 
       | [1]: https://github.com/lloeki/rsync/blob/master/rsync.py
       | 
       | [2]: https://code.liw.fi/obsync/bzr/trunk/
        
         | mmlb wrote:
         | For bzr did you try archive.org?
         | https://web.archive.org/web/20150321194412/https://code.liw....
        
           | generalizations wrote:
           | It's not actually there.
           | 
           | https://web.archive.org/web/20150321212547/http://code.liw.f.
           | ..
        
           | jwilk wrote:
           | I don't think you can clone that.
        
       | lazypenguin wrote:
       | Nice write up. rsync is great as an application but I found it
       | more cumbersome to use when wanting to integrate it into my own
       | application. There's librsync but the documentation is threadbare
       | and it requires an rsync server to run. I found bita/bitar
       | (https://github.com/oll3/bita) which is inspired by rsync &
       | family. It works more like zsync which leverages HTTP Range
       | requests so it doesn't require anything running on the server to
       | get chunks. Works like a treat using s3/b2 storage to serve files
       | and get incremental differential updates on the client side!
        
         | oll3 wrote:
         | Always happy to see my pet project mentioned (bita) and that it
         | is actually being used by others than me :)
        
       | bigChris wrote:
       | Rsync worst issue is someone port scanning and brute force their
       | way into your system. Turn off your port.
        
         | hatware wrote:
         | Don't most consumer routers have all ports blocked? Who is
         | connecting a computer directly to the modem these days?
        
           | stavros wrote:
           | Most of them do NAT, which isn't semantically the same as
           | blocking all ports, but functionally it is.
        
         | seunosewa wrote:
         | Most people use rsync over ssh
        
       | srvmshr wrote:
       | I encountered a strange situation 2 days ago. I rsync my pdf
       | files periodically between my harddrives. rsync showed no
       | differences between two folder trees, but if I did `diff -r`
       | between the two, 3 pdfs came out different.
       | 
       | I checked the three individually but they showed no corruption or
       | changes either side. How can this happen?
       | 
       | Edit: the hard drive copy is previously rsynced from this copy &
       | both copies are mirrored with google cloud bucket.
       | 
       | The 3 files which showed different have the same MD5 checksum
        
         | cliZX81 wrote:
         | The same happend to me syncing to an SD card. The reason may be
         | different timestamp resolutions, which make same files look
         | different. I synced from HFS+ to FAT32 back then.
        
         | orangepurple wrote:
         | rsync uses a heuristic based on file times and sizes to compare
         | files. to compare file content use the --checksum feature
         | (computationally expensive to run)
        
           | Dylan16807 wrote:
           | Rsync can checksum a lot of megabytes per second. In general
           | I'd say the disk IO is much more expensive than the
           | computation.
        
           | srvmshr wrote:
           | Yes but it doesn't answer why rsync & checksum pass the set
           | of files as same, but diff reports them different.
        
             | pronoiac wrote:
             | It isn't clear from the conversation so far: do you use "--
             | checksum"?
        
             | samoppy wrote:
        
       | boomskats wrote:
       | This was a great write up. I've already sent it to a few people.
       | 
       | On the question of what happens if a file's contents change after
       | the initial checksum, the man page for rsync[0] has an
       | interesting explanation of the *--checksum* option:
       | 
       | > This changes the way rsync checks if the files have been
       | changed and are in need of a transfer. Without this option, rsync
       | uses a "quick check" that (by default) checks if each file's size
       | and time of last modification match between the sender and
       | receiver. This option changes this to compare a 128-bit checksum
       | for each file that has a matching size. Generating the checksums
       | means that both sides will expend a lot of disk I/O reading all
       | the data in the files in the transfer (and this is prior to any
       | reading that will be done to transfer changed files), so this can
       | slow things down significantly.
       | 
       | > The sending side generates its checksums while it is doing the
       | file-system scan that builds the list of the available files. The
       | receiver generates its checksums when it is scanning for changed
       | files, and will checksum any file that has the same size as the
       | corresponding sender's file: files with either a changed size or
       | a changed checksum are selected for transfer.
       | 
       | > Note that rsync always verifies that each transferred file was
       | correctly reconstructed on the receiving side by checking a
       | whole-file checksum that is generated as the file is transferred,
       | but that automatic after-the-transfer verification has nothing to
       | do with this option's before-the-transfer "Does this file need to
       | be updated?" check. For protocol 30 and beyond (first supported
       | in 3.0.0), the checksum used is MD5. For older protocols, the
       | checksum used is MD4.
       | 
       | [0]: https://linux.die.net/man/1/rsync
        
         | londons_explore wrote:
         | Failure cases of the 'quick check':
         | 
         | * Underlying disk device corruption - but modern disks do
         | internal error checking, and should emit an IO error.
         | 
         | * Corruption in RAM/software bug in the kernel IO subsystem.
         | Should be detected by filesystem checksumming.
         | 
         | * User has accidentally modified file and set mtime back.
         | _fixes this case_.
         | 
         | * User has maliciously modified file and set mtime back. Since
         | it's MD5 (broken), the malicious user can make the checksum
         | match too. _checksumming doesn 't help_.
         | 
         | Given that, I see no users who really benefit from
         | checksumming. It isn't sufficient for anyone with really high
         | data integrity requirements, while also being overkill for
         | typical usecases.
        
           | naniwaduni wrote:
           | > * User has maliciously modified file and set mtime back.
           | Since it's MD5 (broken), the malicious user can make the
           | checksum match too. checksumming doesn't help.
           | 
           | No, md5 is not broken like that (yet). There is no nkown
           | second-preimage attack against md5; the practical collision
           | vulns only affect cases where an attack controls the file
           | content both before _and_ after update.
        
           | tjoff wrote:
           | The very existence of filesystem checksumming is because your
           | first point isn't always true.
           | 
           | Also, filesystem checksumming does not guard against
           | ram/kernel-bugs. On top of that file system checksumming is
           | very rare.
        
         | jwilk wrote:
         | > https://linux.die.net/man/1/rsync
         | 
         | linux.die.net is horribly outdated. This particular page is
         | from 2009.
         | 
         | Up-to-date docs are here:
         | 
         | https://download.samba.org/pub/rsync/rsync.1
        
           | secure wrote:
           | Which version is the samba one? Latest release? Git?
           | 
           | If you want to see the man page of the version in Debian,
           | that would be
           | https://manpages.debian.org/testing/rsync/rsync.1.en.html
           | 
           | Disclaimer: I wrote the software behind manpages.debian.org
           | :)
        
             | jwilk wrote:
             | "This manpage is current for version 3.2.5dev of rsync" -
             | so I guess it's from git.
        
         | kzrdude wrote:
         | I guess zfs send and similar are better solutions, but what if
         | we could query the filesystem for existing checksums of a file
         | and save IO that way, if filesystems on both sides already
         | stored usable checksums?
        
           | formerly_proven wrote:
           | Unless you are also doing FS-level deduplication using the
           | same checksums, it generally makes no sense for these to be
           | cryptographic hashes, so they're not necessarily suitable for
           | this purpose.
           | 
           | IIRC neither ZFS nor btrfs use cryptographic hashes for
           | checksumming by default.
        
             | throw0101a wrote:
             | > _on is a short hand for fletcher4 for non-deduped
             | datasets and sha256 for deduped datasets_
             | 
             | * https://openzfs.github.io/openzfs-
             | docs/Basic%20Concepts/Chec...
             | 
             | * https://people.freebsd.org/~asomers/fletcher.pdf
             | 
             | * https://en.wikipedia.org/wiki/Fletcher%27s_checksum
             | 
             | Strangely enough SHA-512 is actually (50%) faster than
             | -256:
             | 
             | > _ZFS actually uses a special version of SHA512 called
             | SHA512t256, it uses a different initial value, and
             | truncates the results to 256 bits (that is all the room
             | there is in the block pointer). The advantage is only that
             | it is faster on 64 bit CPUs._
             | 
             | * https://twitter.com/klarainc/status/1367199461546614788
        
               | slavik81 wrote:
               | It seems that SHA512t256 is another name for SHA-512/256.
               | It's a shame that the initialization is different from
               | SHA-512, as it would have been very useful to be able to
               | convert a SHA-512 hash into a SHA-512/256 hash.
        
               | [deleted]
        
           | js2 wrote:
           | > I guess zfs send and similar are better solutions.
           | 
           | It depends. I recently built a new zfs pool server and needed
           | to transfer a few TB of data from the old pool to the new
           | pool, but I built the new pool with a larger record size. If
           | I'd used zfs send the files would have retained their
           | existing record size. So rsync it was.
        
         | throw0101a wrote:
         | > _For protocol 30 and beyond (first supported in 3.0.0), the
         | checksum used is MD5. For older protocols, the checksum used is
         | MD4._
         | 
         | Newer versions (>=3.2?) support xxHash and xxHash3:
         | 
         | * https://github.com/WayneD/rsync/blob/master/checksum.c
         | 
         | * https://github.com/Cyan4973/xxHash
         | 
         | * https://news.ycombinator.com/item?id=19402602 (2019 XXH3
         | discussion)
        
           | aidenn0 wrote:
           | I thought xxHash was only used for the chunk hash, not the
           | whole file hash?
        
       | throw0101a wrote:
       | This is also available as a video, "Why I wrote my own rsync":
       | 
       | * https://media.ccc.de/v/gpn20-41-why-i-wrote-my-own-rsync
       | 
       | * https://www.youtube.com/watch?v=wpwObdgemoE
        
       | throw0101a wrote:
       | See also the 1996 original paper by Tridgell (also of Samba fame)
       | and Mackerras:
       | 
       | * https://rsync.samba.org/tech_report/
       | 
       | * https://www.andrew.cmu.edu/course/15-749/READINGS/required/c...
        
         | secure wrote:
         | Another great resource from Tridgell himself is this Ottawa
         | Linux Symposium talk:
         | http://olstrans.sourceforge.net/release/OLS2000-rsync/OLS200...
        
       | CamperBob2 wrote:
       | I don't see why any of this is needed. Just install Dropbox,
       | and...
        
         | LambdaComplex wrote:
         | I'm gonna interpret this in the best faith possible, assume
         | you're referencing the infamous Dropbox HN comment, upvote you
         | to counteract the downvotes from people who missed the joke,
         | and link to the aforementioned comment:
         | https://news.ycombinator.com/item?id=9224
        
           | CamperBob2 wrote:
           | The irony is that BrandonM's classic HN comment makes _more_
           | sense these days, as Dropbox continues its evolution towards
           | Microsofthood. Increasingly, using Dropbox means putting up
           | with an unceasing deluge of product promotions while you 're
           | trying to get your work done. No such annoyances with rsync
           | and ftp and the like.
           | 
           | Just last week, Dropbox unilaterally decided I didn't want
           | local copies of the shared files on my laptop, which made for
           | some awkwardness inside a secured facility with no Internet
           | access.
        
         | wardedVibe wrote:
         | Let them use rsync for you?
        
       ___________________________________________________________________
       (page generated 2022-07-02 23:00 UTC)