[HN Gopher] ZFS 2.2.0 (RC): Block Cloning merged
       ___________________________________________________________________
        
       ZFS 2.2.0 (RC): Block Cloning merged
        
       Author : turrini
       Score  : 176 points
       Date   : 2023-07-04 15:46 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | miohtama wrote:
       | What are applications that benefit from block cloning?
        
         | aardvark179 wrote:
         | It can be a really convenient way to snapshot something if you
         | can arrange some point at which everything is synced to disk.
         | Get to that point, make your new files that start sharing all
         | their blocks, and then let your main db process (or whatever)
         | continue on as normal.
        
         | ikiris wrote:
         | I think the big piece is native overlayfs so k8 setups get a
         | bit simpler.
        
         | philsnow wrote:
         | It seems kind of like hard linking but with copy-on-write for
         | the underlying data, so you'll get near-instant file copies and
         | writing into the middle of them will also be near-instant.
         | 
         | All of this happens under the covers already if you have dedup
         | turned on, but this allows utilities (gnu cp might be taught to
         | opportunistically and transparently use the new clone zfs
         | syscalls, because there is no downside and only upside) and
         | applications to tell zfs that "these blocks are going to be the
         | same as those" without zfs needing to hash all the new blocks
         | and compare them.
         | 
         | Aditionally, for finer control, ranges of blocks can be cloned,
         | not just entire files.
         | 
         | I can't tell from the github issue, can this manual dedup /
         | block cloning be turned on if you're not already using dedup on
         | a dataset? Last time I set up zfs, I was warned that dedup took
         | gobs of memory, so I didn't turn it on.
        
           | rincebrain wrote:
           | It's orthogonal to dedup being on or off, and as someone else
           | said, it's more or less the same underlying semantics you
           | would expect from cp --reflink anywhere.
           | 
           | Also, as mentioned, on Linux, it's not wired up with any
           | interface to be used at all right now.
        
           | nabla9 wrote:
           | Gnu cp --reflink.
           | 
           | >When --reflink[=always] is specified, perform a lightweight
           | copy, where the data blocks are copied only when modified. If
           | this is not possible the copy fails, or if --reflink=auto is
           | specified, fall back to a standard copy. Use --reflink=never
           | to ensure a standard copy is performed."
        
         | danudey wrote:
         | As others have said: block cloning (the underlying technology
         | that enables copy-on-write) allows you to 'copy' a file without
         | reading all of the data and re-writing it.
         | 
         | For example, if you have a 1 GB file and you want to make a
         | copy of it, you need to read the whole file (all at once or in
         | parts) and then write the whole new file (all at once or in
         | parts). This results in 1 GB of reads and 1 GB of writes.
         | Obviously the slower (or more overloaded) your storage media
         | is, the longer this takes.
         | 
         | With block cloning, you simply tell the OS "I want this file A
         | to be a copy of this file B" and it creates a new "file" that
         | references all the blocks in the old "file". Given that a
         | "file" on a filesystem is just a list of blocks that make up
         | the data in that file, you can create a new "file" which has
         | pointers to the same blocks as the old "file". This is a simple
         | system call (or a few system calls), and as such isn't much
         | more intensive than simply renaming a file instead of copying
         | it.
         | 
         | At my previous job we did builds for our software. This
         | required building the BIOS, kernel, userspace, generating the
         | UI, and so on. These builds required pulling down 10+ GB of git
         | repositories (the git data itself, the checkout, the LFS binary
         | files, external vendor SDKs), and then a large amount of build
         | artifacts on top of that. We also needed to do this build for
         | 80-100 different product models, for both release and debug
         | versions. This meant 200+ copies of the source code alone (not
         | to mention build artifacts and intermediate products), and
         | because of disk space limitations this meant we had to
         | dramatically reduce the number of concurrent builds we could
         | run. The solution we came up with was something like:
         | 
         | 1. Check out the source code
         | 
         | 2. Create an overlayfs filesystem to mount into each build
         | space
         | 
         | 3. Do the build
         | 
         | 4. Tear down the overlayfs filesystem
         | 
         | This was problematic if we weren't able to mount the
         | filesystem, if we weren't able to unmount the filesystem
         | (because of hanging file descriptors or processes), and so on.
         | Lots of moving parts, lots of `sudo` commands in the scripts,
         | and so on.
         | 
         | Copy-on-write would have solved this for us by accomplishing
         | the same thing; we could simply do the following:
         | 
         | 1. Check out the source code
         | 
         | 2. Have each build process simply `cp -R --reflink=always
         | source/ build_root/`; this would be instantaneous and use no
         | new disk space.
         | 
         | 3. Do the build
         | 
         | 4. `rm -rf build_root`
         | 
         | Fewer moving parts, no root access required, generally simpler
         | all around.
        
         | thrill wrote:
         | FTFA: "Block Cloning allows to clone a file (or a subset of its
         | blocks) into another (or the same) file by just creating
         | additional references to the data blocks without copying the
         | data itself. Block Cloning can be described as a fast, manual
         | deduplication."
        
         | the8472 wrote:
         | Any copy command. On-demand deduplication managed by userspace.
         | 
         | https://man7.org/linux/man-pages/man2/ioctl_fideduperange.2....
         | https://man7.org/linux/man-pages/man2/copy_file_range.2.html
         | https://github.com/markfasheh/duperemove
        
         | vovin wrote:
         | This is huge. One practical application is fast recovery of a
         | file from past snapshot without using any additional space. I
         | use ZFS dataset for my vCenter datastore (storing my vmdk
         | files). In case of need to launch a clone from a past state one
         | could use a block cloning to bring past vmdk file without the
         | need to actually copy the file - it saves both space and time
         | to make such clone.
        
           | bithavoc wrote:
           | Can you elaborate a bit more on how you use ZFS with vCenter?
           | How do you mount it?
        
         | mustache_kimono wrote:
         | Excited, because in addition to ref copies/clones, httm will
         | use this feature, if available (I've already done some work to
         | implement), for its `--roll-forward` operation, and for faster
         | file recoveries from snapshots [0].
         | 
         | As I understand it, there will be no need to copy any data from
         | the same dataset, and _this includes all snapshots_. Blocks
         | written to the live dataset can just be references to the
         | underlying blocks, and no additional space will need be used.
         | 
         | Imagine being able to continuously switch a file or a dataset
         | back to a previous state extremely quickly without a heavy
         | weight clone, or a rollback, etc.
         | 
         | Right now, httm simply diff copies the blocks for file recovery
         | and roll-forward. For further details, see the man page entry
         | for `--roll-forward`, and the link to the httm GitHub below:
         | --roll-forward="snap_name"              traditionally 'zfs
         | rollback' is a destructive operation, whereas httm roll-forward
         | is non-destructive.  httm will copy only the blocks and file
         | metadata that have changed since a specified snapshot, from
         | that snapshot, to its live dataset.  httm will also take two
         | precautionary snapshots, one before and one after the copy.
         | Should the roll forward fail for any reason, httm will roll
         | back to the pre-execution state.  Note: This is a ZFS only
         | option which requires super user privileges.
         | 
         | [0]: https://github.com/kimono-koans/httm
        
       | rossmohax wrote:
       | Does ZFS or any other FS offer special operations which DB engine
       | like RocksDB, SQLite or PostgreSQL could benefit from if they
       | decided to target that FS specifically?
        
         | magicalhippo wrote:
         | Internally, ZFS is kinda like an object store[1], and there was
         | a project trying to expose the ZFS internals through an object
         | store API rather than through a filesystem API.
         | 
         | Sadly I can't seem to find the presentation or recall the name
         | of the project.
         | 
         | On the other hand, looking at for example RocksDB[2]:
         | 
         |  _File system operations are not atomic, and are susceptible to
         | inconsistencies in the event of system failure. Even with
         | journaling turned on, file systems do not guarantee consistency
         | on unclean restart. POSIX file system does not support atomic
         | batching of operations either. Hence, it is not possible to
         | rely on metadata embedded in RocksDB datastore files to
         | reconstruct the last consistent state of the RocksDB on
         | restart. RocksDB has a built-in mechanism to overcome these
         | limitations of POSIX file system [...]_
         | 
         | ZFS _does_ provide atomic operations internally[1], so if
         | exposed it seems something like RocksDB could take advantage of
         | that and forego all the complexity mentioned above.
         | 
         | How much that would help I don't know though, but seems
         | potentially interesting at first glance.
         | 
         | [1]: https://youtu.be/MsY-BafQgj4?t=442
         | 
         | [2]: https://github.com/facebook/rocksdb/wiki/MANIFEST
        
       | ludde wrote:
       | A M A Z I N G
       | 
       | Have been looking forward to this for years!
       | 
       | This is so much better than automatically doing dedup and the RAM
       | overhead that entails.
       | 
       | Doing offline/RAM+in memory dedup size optimizations seem like a
       | really good optimization path. In the spirit of also paying only
       | what you use and not the rest.
       | 
       | Edit: What's the RAM overhead of this? Is it ~64B per 128kB
       | deduped block or what's the magnitude of things?
        
         | mlyle wrote:
         | > Edit: What's the RAM overhead of this? Is it ~64B per 128kB
         | deduped block or what's the magnitude of things?
         | 
         | No real memory impact. There's a regions table that uses 128k
         | of memory per terabyte of total storage (and may be a bit more
         | in the future). So for your 10 petabyte pool using deduping,
         | you'd better have an extra gigabyte of RAM.
         | 
         | But erasing files can potentially be twice as expensive in
         | IOPS, even if not deduped. They try to prevent this.
        
       | uvatbc wrote:
       | Technically, yes: through the use of Truenas that gives us API
       | access to iscsi on ZFS.
        
       | GauntletWizard wrote:
       | > Note: currently it is not possible to clone blocks between
       | encrypted datasets, even if those datasets use the same
       | encryption key (this includes snapshots of encrypted datasets).
       | Cloning blocks between datasets that use the same keys should be
       | possible and should be implemented in the future.
       | 
       | Once this is ready, I am going to subdivide my user homedir much
       | more than it already is. The biggest obstacle in the way of this
       | has been that it would waste a bunch of space until the snapshots
       | were done rolling over, which for me is a long time (I keep
       | weekly snapshots of my homedir for a year).
        
         | yjftsjthsd-h wrote:
         | Is there a benefit to breaking up your home directory?
        
           | GauntletWizard wrote:
           | Controlling the rate and location of snapshots, mostly. I've
           | broken out some kinds of datasets (video archives) but not
           | others historically (music). It doesn't matter that much, but
           | I want to split some more chunks out.
        
             | yjftsjthsd-h wrote:
             | Fair enough. I've personally slowly moved to a smaller
             | number of filesystems, but if you're actually handling
             | snapshots differently per-area then it makes sense (indeed,
             | one of the reasons I'm consolidating is the realization
             | that _personally_ I 'm almost never going to
             | snapshot/restore things separately).
        
           | someplaceguy wrote:
           | Not sure why you'd want to do that to your home directory
           | usually, but it depends on what you store in it and how you
           | use it, really.
           | 
           | In general, breaking up a filesystem into multiple ones in
           | ZFS is mostly useful for making filesystem management more
           | fine-grained, as a filesystem/dataset in ZFS is the unit of
           | management for most properties and operations (snapshots,
           | clones, compression and checksum algorithms, quotas,
           | encryption, dedup, send/recv, ditto copies, etc) as well as
           | their inheritance and space accounting.
           | 
           | In terms of filesystem management, there aren't many
           | downsides to breaking up a filesystem (within reason), as
           | most properties and the most common operations can be shared
           | between all sub-filesystems if they are part of the same
           | inherited tree (which doesn't necessarily have to correspond
           | to the mountpoint tree!).
           | 
           | As far as I know, the major downsides by far were that 1) you
           | couldn't quickly move a file from one dataset to another,
           | i.e. `mv` would be forced to do a full copy of the file
           | contents rather than just do a cheap rename, and 2) in terms
           | of disk space, moving a file between filesystems would be
           | equivalent to copying the file and deleting the original,
           | which could be terrible if you use snapshots as it would lead
           | to an additional space consumption of a full new file's worth
           | of disk space.
           | 
           | In principle, both of these downsides should be fixed with
           | this new block cloning feature and AFAIU the only tradeoffs
           | would be some amount of increased overhead when freeing data
           | (which should be zero overhead if you don't have many of
           | these cloned blocks being shared anymore), and the low
           | maturity of this code (i.e. higher chance of running into
           | bugs) due to being so new.
        
       | dark-star wrote:
       | Wow, I was under the impression that this had long been
       | implemented already (as it's already in btrfs and other
       | commercial file systems)
       | 
       | Awesome!
        
         | mgerdts wrote:
         | It has been in the Solaris version of zfs for a long time as
         | well. This came a few years after the Oracle-imposed fork.
         | 
         | https://blogs.oracle.com/solaris/post/reflink3c-what-is-it-w...
        
       | Pxtl wrote:
       | Filesystem-level de-duplication is scary as hell as a concept,
       | but also sounds amazing, especially doing it at copy-time so you
       | don't have to opt-in to scanning to deduplicate. Is this common
       | in filesystems? Or is ZFS striking out new ground here? I'm not
       | really an under-the-OS-hood kinda guy.
        
         | yjftsjthsd-h wrote:
         | > Filesystem-level de-duplication is scary as hell as a concept
         | 
         | What's scary about it? You have to track references, but it
         | doesn't seem _that_ hard compared to everything else going on
         | in ZFS et al.
         | 
         | > Is this common in filesystems? Or is ZFS striking out new
         | ground here? At least BTRFS does approximately the same.
        
           | cesarb wrote:
           | > > Filesystem-level de-duplication is scary as hell as a
           | concept
           | 
           | > What's scary about it?
           | 
           | It's scary because there's only one copy when you might have
           | expected two. A single bad block could lose both "copies" at
           | once.
        
             | phpisthebest wrote:
             | 3-2-1..
             | 
             | 3 Copies
             | 
             | 2 Media
             | 
             | 1 offsite.
             | 
             | If you follow that then you would have no fear of data
             | loss. if you are putting 2 copies on the same filesystem
             | you are already doing backups wrongs
        
             | magicalhippo wrote:
             | Besides what all the others have mentioned, you can force
             | ZFS to keep up to 3 copies of data blocks on a dataset. ZFS
             | uses this internally for important metadata and will try to
             | spread them around to maximize the chance of recovery,
             | though don't rely on this feature alone for redundancy.
        
             | Filligree wrote:
             | Disks die all the time anyway. If you want to keep your
             | data, you should have at least two-disk redundancy. In
             | which case bad blocks won't kill anything.
        
             | ForkMeOnTinder wrote:
             | Copying a file isn't great protection against bad blocks.
             | Modern SSDs, when they fail, tend to fail catastrophically
             | (the whole device dies all at once), rather than dying one
             | block at a time. If you care about the data, back it up on
             | a separate piece of hardware.
        
             | grepfru_it wrote:
             | The file system metadata is redundant and on a correctly
             | configured ZFS system your error correction is isolated and
             | can be redundant as well
        
           | Pxtl wrote:
           | > What's scary about it?
           | 
           | Just that I'm trusting the OS to re-duplicate it at block
           | level on file write. The idea that block by block you've got
           | "okay, this block is shared by files XYZ, this next block is
           | unique to file Z, then the next block is back to XYZ... oh
           | we're editing that one? Then it's a new block that's now
           | unique to file Z too".
           | 
           | I guess I'm not used to trusting filesystems to do anything
           | but dumb write and read. I know they abstract away a crapload
           | of amazing complexity in reality, I'm just used to thinking
           | of them as dumb bags of bits.
        
             | Dylan16807 wrote:
             | If you're on ZFS you're probably using snapshots, so all
             | that work is already happening.
        
         | wrs wrote:
         | MacOS and BTRFS have had it for several years. In fact I
         | believe it's the default behavior when copying a file in MacOS
         | using the Finder (you have to specify `cp -c` in shell).
        
           | nijave wrote:
           | Windows Server has had dedupe for at least 10 years, too
        
       | lockhouse wrote:
       | Anyone here using ZFS in production these days? If so what OS and
       | implementation? What have been your experiences or gotchas you
       | experienced?
        
         | Drybones wrote:
         | We use ZFS on every server we deploy
         | 
         | We typically use Proxmox. It's a convenient node host setup and
         | usually has a very up to date zfs and it's stable
         | 
         | I just wouldn't use the Proxmox web ui for zfs configuration.
         | It doesn't have up to date options. Always configure zfs on the
         | cli
        
         | shrubble wrote:
         | The gotcha on Proxmox is that you can't do swapfiles on ZFS, so
         | if your swap isn't made big enough when installing and you
         | format everything as ZFS you have to live with it or do odd
         | workarounds.
        
         | trws wrote:
         | I'm not a filesystem admin, but we at LLNL use OpenZFS as the
         | storage layer for all of our Lustre file systems in production,
         | including using raid-z for resilience in each pool (order 100
         | disks each), and have for most of a decade. That combined with
         | improvements in Lustre have taken the rate of data loss or need
         | to clear large scale shared file systems down to nearly zero.
         | There's a reason we spend as many engineer hours as we do
         | maintaining it, it's worth it.
         | 
         | LLNL openzfs project:
         | https://computing.llnl.gov/projects/openzfs Old presentation
         | from intel with info on what was one of our bigger deployments
         | in 2016 (~50pb):
         | https://www.intel.com/content/dam/www/public/us/en/documents...
        
           | muxator wrote:
           | If I'm not mistaken the linux port of ZFS that later became
           | OpenZFS started at LLNL and was a port from FreeBSD (it may
           | have been release ~9).
           | 
           | I believe it was called ZFS On Linux or something like that.
           | 
           | Nice how things have evolved: from FreeBSD to linux and back.
           | In my mind this has always been a very inspiring example of a
           | public institution working for the public good.
        
             | rincebrain wrote:
             | FreeBSD had its own ZFS port.
             | 
             | ZoL, if my ancient memory serves, was at LLNL, not based on
             | the FreeBSD port (if you go _very_ far back in the commit
             | history you can see Brian rebasing against OpenSolaris
             | revisions), but like 2 or 3 different orgs originally
             | announced Linux ports at the same time and then all pooled
             | together, since originally only one of the three was going
             | to have a POSIX layer (the other two didn't need a working
             | POSIX filesystem layer). (I'm not actually sure how much
             | came of this collaboration, I just remember being very
             | amused when within the span of a week or two, three
             | different orgs announced ports, looked at each other, and
             | went "...wait.")
             | 
             | Then for a while people developed on either the FreeBSD
             | port, the illumos fork called OpenZFS, or the Linux port,
             | but because (among other reasons) a bunch of development
             | kept happening on the Linux port, it became the defacto
             | upstream and got renamed "OpenZFS", and then FreeBSD more
             | or less got a fresh port from the OpenZFS codebase that is
             | now what it's based on.
             | 
             | The macOS port got a fresh sync against that codebase
             | recently and is slowly trying to merge in, and then from
             | there, ???
        
         | justinclift wrote:
         | TrueNAS (www.truenas.com) uses ZFS for the storage layer across
         | it's product range (storage software).
         | 
         | They have both FreeBSD and Linux based stuff, targeting
         | different use cases.
        
         | quags wrote:
         | I have been using zfs for years from ubuntu 18. Easy snapshots,
         | monitoring, choices for raid levels, and ability to very easily
         | copy a dataset remotely with resume or incremental support is
         | awesome. I mainly use it for kvm systems each with their own
         | dataset. Coming from mdadm + lvm from my previous set up is
         | night and day for doing snapshots and backups. I do not use zfs
         | on root for ubuntu instead I do a raid1 software set up for the
         | os and then a zfs set up on other disks - zfs on root was the
         | only gotcha. For FreeBSD zfs on root works fine.
        
         | noaheverett wrote:
         | Running in production for about 3 years with Ubuntu 20.04 / zfs
         | 0.8.3. ZFS is being used as the datastore for a cluster of
         | LXD/LXC instances over multiple physical hosts. I have the OS
         | setup on its own dedicated drive and ZFS striped/cloned over 4
         | NVMe drives.
         | 
         | No gotchas / issues, works well, easy to setup.
         | 
         | I am looking forward to the Direct IO speed improvements for
         | NVMe drives with https://github.com/openzfs/zfs/pull/10018
         | 
         | edit: one thing I forgot to mention is, when creating your pool
         | make sure to import your drives by ID (zpool import -d
         | /dev/disk/by-id/ <poolname>) instead of name in case name
         | assignments change somehow [1]
         | 
         | [1] https://superuser.com/questions/1732532/zfs-disk-drive-
         | lette...
        
           | throw0101c wrote:
           | > _I am looking forward to the Direct IO speed improvements
           | for NVMe drives
           | withhttps://github.com/openzfs/zfs/pull/10018_
           | 
           | See also "Scaling ZFS for NVMe" by Allan Jude at EuroBSDcon
           | 2022:
           | 
           | * https://www.youtube.com/watch?v=v8sl8gj9UnA
        
             | noaheverett wrote:
             | Sweet, I appreciate the link!
        
         | gigatexal wrote:
         | Any of the enterprise customers of klara Systems are likely ZFS
         | production folks.
         | 
         | https://klarasystems.com/?amp
        
         | drewg123 wrote:
         | We use ZFS in production for the non-content filesystems on a
         | large, ever increasing, percentage of our Netflix Open Connect
         | CDN nodes, replacing geom's gmirror. We had one gotcha (caught
         | on a very limited set of canaries) where a buggy bootloader
         | crashed part-way through boot, leaving the ZFS "bootonce" stuff
         | in a funky state requiring manual recovery (the nodes with
         | gmirror were fine, and fell back to the old image without
         | fuss). This has since been fixed.
         | 
         | Note that we _do not_ use ZFS for content, since it is
         | incompatible with efficient use of sendfile (both because there
         | is no async handler for ZFS, so no async sendfile, and because
         | the ARC is not integrated with the page cache, so content would
         | require an extra memory copy to be served).
        
           | lifty wrote:
           | what do you use for the content filesystem?
        
             | drewg123 wrote:
             | FreeBSD's UFS
        
           | throw0101c wrote:
           | Any use of boot environments for easy(er?) rollbacks of OS
           | updates?
        
             | drewg123 wrote:
             | Yes. That's the bootonce thing I was talking about. When we
             | update the OS, we set the "bootonce" flag via bectl
             | activate -t to ensure we fall back to the previous BE if
             | the current BE is borked and not bootable. This is the same
             | functionality we had by keeping a primary and secondary
             | root partition in geom a and toggling the bootable
             | partition via the bootonce flag in gpart.
        
           | bakul wrote:
           | Is integrating ARC with the page cache a lost cause? If not,
           | may be Netflix can fund it!
        
             | postmodest wrote:
             | I would expect page cache to be one of those platform-
             | dependent things that prevent OpenZFS from doing that.
             | Especially on Linux, and especially because AFAIK the Linux
             | version has the most eyes on it.
        
               | rincebrain wrote:
               | My understanding, not having dug into it, is that it's
               | possible but just work nobody has done yet, though I'm
               | not sure what the relevant interfaces are in the Linux
               | kernel.
               | 
               | One thing that makes the interfaces in Linux much messier
               | than the FreeBSD ones is that a lot of the core
               | functionality you might like to leverage in Linux
               | (workqueues, basically anything more complicated than
               | just calling kmalloc, and of course any SIMD save/restore
               | state, to name three examples I've stumbled over
               | recently) are marked EXPORT_SYMBOL_GPL or just entirely
               | not exported in newer releases, so you get to reimplement
               | the wheel for those, whereas on FreeBSD it's trivial to
               | just use their implementations of such things and shim
               | them to the Solaris-ish interfaces the non-platform-
               | specific code expects.
               | 
               | So that makes the Linux-specific code a lot heavier,
               | because upstream is actively hostile.
        
               | Dylan16807 wrote:
               | > SIMD save/restore state
               | 
               | I wish someone would come in and convince the kernel devs
               | that "hey, if you want EXPORT_SYMBOL_GPL to have legal
               | weight in a copyleft sense then you can't just slap it
               | onto interfaces for political reasons"
        
               | rincebrain wrote:
               | I don't think they care about it having legal weight,
               | that ship sailed long ago when they started advocating
               | for just slapping SYMBOL_GPL on things out of spite; I
               | think they care about excluding people from using their
               | software.
               | 
               | IMO Linus should stop being half-and-half about it and
               | either mark everything SYMBOL_GPL and see how well that
               | goes or stop this nonsense.
        
               | Filligree wrote:
               | I just don't understand why they're so anti-ZFS. I want
               | my data to survive, please...
        
               | rincebrain wrote:
               | My impression is that some of the Linux kernel devs are
               | anti-anything that's not GPL-compatible, of any sort,
               | regardless of the particulars.
               | 
               | Linus himself also made remarks about ZFS at one point
               | that were pretty...hostile. [1] [2]
               | 
               | > The fact is, the whole point of the GPL is that you're
               | being "paid" in terms of tit-for-tat: we give source code
               | to you for free, but we want source code improvements
               | back. If you don't do that but instead say "I think this
               | is _legal_, but I'm not going to help you" you certainly
               | don't get any help from us.
               | 
               | > So things that are outside the kernel tree simply do
               | not matter to us. They get absolutely zero attention. We
               | simply don't care. It's that simple.
               | 
               | > And things that don't do that "give back" have no
               | business talking about us being assholes when we don't
               | care about them.
               | 
               | > See?
               | 
               | Note that there's at least one unfixed Linux kernel bug
               | that was found by OpenZFS users, reproducible without
               | using OpenZFS in any way, reported with a patch, and
               | ignored. [3]
               | 
               | So "not giving back" is a dubious claim.
               | 
               | [1] - https://arstechnica.com/gadgets/2020/01/linus-
               | torvalds-zfs-s...
               | 
               | [2] - https://www.realworldtech.com/forum/?threadid=18971
               | 1&curpost...
               | 
               | [3] - https://bugzilla.kernel.org/show_bug.cgi?id=212295
        
               | colonwqbang wrote:
               | Why don't you think it has legal weight? Or did you mean
               | something else?
               | 
               | As far as know the point of EXPORT_SYMBOL_GPL was to push
               | back on companies like Nvidia who wanted to exploit
               | loopholes in the GPL. That seems to me like a reasonable
               | objective.
               | 
               | Relevant Torvalds quote:
               | https://yarchive.net/comp/linux/export_symbol_gpl.html
        
               | rincebrain wrote:
               | Sure, and that alone isn't an unreasonable premise - as
               | he says, intent matters.
               | 
               | But if you're marking interfaces as GPL-only, or
               | implementing taint detection that means if you use a non-
               | SYMBOL_GPL kernel symbol which calls a GPL-only function
               | it treats the non-SYMBOL_GPL symbol as GPL-only and
               | blocks your linking, it gets a bit out of hand.
               | 
               | Building the kernel with certain kernel options makes
               | modules like OpenZFS or OpenAFS not link because of that
               | taint propagation - because things like the lockdep
               | checker turn uninfringing calls into infringing ones.
               | 
               | Or a little while ago, there was a change which broke
               | building on PPC because a change made a non-SYMBOL_GPL
               | call on POWER into a SYMBOL_GPL one indirectly, and when
               | the original author was contacted, he sent a patch
               | reverting the changed symbol, and GregKH refused to pull
               | it into stable, suggesting distros could carry it if they
               | wanted to. (Of course, he had happily merged a change
               | into -stable earlier that just implemented more
               | aggressive GPL tainting and thereby broke things like the
               | aforementioned...)
        
               | PlutoIsAPlanet wrote:
               | The Linux kernel has never supported out of tree modules
               | like how ZFS works out of tree.
               | 
               | All ZFS needs to do is just have one of Oracles many
               | lawyers say "CDDL is compatible with GPL". Yet, they
               | Oracle don't.
        
               | rincebrain wrote:
               | "All."
               | 
               | It's explicitly not compatible with GPL, though. It has
               | clauses that are more restrictive than GPL, and IIRC some
               | people who contributed to the OpenZFS project did so
               | explicitly without allowing later CDDL license revisions,
               | which removes Oracle's ability to say CDDL-2 or whatever
               | is GPL-compatible.
               | 
               | So even if someone rolled up dumptrucks of cash and
               | convinced Oracle that everything was great, they don't
               | have all the control needed to do that.
        
               | Dylan16807 wrote:
               | To have legal weight, it has to be a signal that you're
               | implementing something that is derivative of kernel code.
               | That's the directly stated intent of EXPORT_SYMBOL_GPL.
               | 
               | But "call an opaque function that saves SIMD state" is
               | obviously not derivative of the kernel code in any way.
               | The more exports that get badly marked this way, the more
               | EXPORT_SYMBOL_GPL becomes indistinguishable from
               | EXPORT_SYMBOL.
        
               | colonwqbang wrote:
               | I see it as just a kind of "warranty void if seal
               | broken". Don't do this or you _may_ be in violation of
               | the GPL. Maybe a legal court in $country would find in
               | your favour (I 'm not convinced it's as clear cut as you
               | imply). Maybe they would find that you willfully
               | infringed, despite the kernel devs clearly warning you
               | not to do it.
               | 
               | The main "legal effect" I see is that you are not willing
               | to take that risk, just like Oracle isn't.
        
               | bakul wrote:
               | I suspect the underlying issues for not unifying the two
               | have more to do with the ZFS design than anything to do
               | with Linux. It may be the codebase is far too large at
               | this stage to make such a fundamental change.
        
               | rincebrain wrote:
               | I don't think so. The memory management stuff is pretty
               | well abstracted; on FBSD it just glues into UMA pretty
               | transparently, it's just on Linux there's a lot of
               | machinery for implementing our own little cache
               | allocating because Linux's kernel cache allocator is very
               | limited in what sizes it will give you, and sometimes ZFS
               | wants 16M (not necessarily contiguous) regions because
               | someone said they wanted 16M records.
               | 
               | The ZoL project lead said at one point there were a
               | variety of reasons this wasn't initially done for the
               | Linux integration [1], but that it was worth taking
               | another look at since that was a decade ago now. Having
               | looked at the Linux memory subsystems recently for
               | various reasons, I would suspect the limiting factor is
               | that almost all the Linux memory management functions
               | that involve details beyond "give me X pages" are
               | SYMBOL_GPL, so I suspect we couldn't access whatever
               | functionality would be needed to do this.
               | 
               | I could be wrong, though, as I wasn't looking at the code
               | for that specific purpose, so I might have missed
               | functionality that would provide this.
               | 
               | [1] - https://github.com/openzfs/zfs/issues/10255#issueco
               | mment-620...
        
               | bakul wrote:
               | Behlendorf's comment in that thread seems to be talking
               | about linux integration. My point was this is an older
               | issue, going back to the Sun days. See for instance this
               | thread in where McVoy complains about the same issue! htt
               | ps://www.tuhs.org/pipermail/tuhs/2021-February/023013.htm
               | ...
        
               | rincebrain wrote:
               | That seems more like it's complaining about it not being
               | the actual page cache, not it not being counted as
               | "cache", which is a larger set in at least Linux than
               | just the page cache itself.
               | 
               | But sure, it's certainly an older issue, and given that
               | the ABD rework happened, I wouldn't put anything past
               | being "feasible" if the benefits were great enough.
               | 
               | (Look at the O_DIRECT zvol rework stuff that's pending (I
               | believe not merged) for how a more cut-through memory
               | model could be done, though that has all the tradeoffs
               | you might expect of skipping the abstractions ZFS uses to
               | minimize the ability of applications to poke holes in the
               | abstraction model and violate consistency, I believe...)
        
               | the8472 wrote:
               | Could the linux integration use dax[0] to bypass the page
               | cache and go straight to ARC?
               | 
               | [0] https://www.kernel.org/doc/Documentation/filesystems/
               | dax.txt
        
           | gigatexal wrote:
           | This is amazing. A detail of Netflix that, I a plebe,
           | wouldn't know if not for this site.
        
             | ComputerGuru wrote:
             | Actually, Drew's presentations about Netflix, FreeBSD, ZFS,
             | saturating high-bandwidth network adapters, etc. are
             | legendary and have been posted far and wide. But having him
             | available to answer questions on HN just takes it to a
             | whole 'nother level.
        
               | drewg123 wrote:
               | You're making me blush.. But, to set the record straight:
               | I actually know very little about ZFS, beyond basic
               | user/admin knowledge (from having run it for ~15 years).
               | I've never spoken about it, and other members of the team
               | I work for at Netflix are far more knowledgeable about
               | ZFS, and are the ones who have managed the conversion of
               | our fleet to ZFS for non-content partitions.
        
               | gigatexal wrote:
               | Have they ever blogged or spoke at conferences about it?
               | I soak up all that content -- least I try to.
        
               | ComputerGuru wrote:
               | I've devoured your FreeBSD networking presentations but I
               | guess I must have confused a post about tracking down a
               | ZFS bug in production written by someone else with all
               | the other content you've produced.
               | 
               | Back to the topic at hand, it's actually scary how few
               | software expose control over whether or not sendfile is
               | used, assuming support is only a matter of OS and kernel
               | version but not taking into account filesystem
               | limitations. I ran into a terrible Samba on FreeBSD bug
               | (shares remotely disconnected and connections reset with
               | moderate levels of concurrent ro access from even a
               | single client) that I ultimately tracked down to sendfile
               | being enabled in the (default?) config - so it wasn't
               | just the expected "performance requirements not being
               | met" with sendfile on ZFS but even other reliability
               | issues (almost certainly exposing a different underlying
               | bug, tbh). Imagine if Samba didn't have a tubeable to
               | set/override sendfile support, though.
        
           | xmodem wrote:
           | If you can share, what type of non-content data do the nodes
           | store? Is this just OS+application+logs?
        
         | nightfly wrote:
         | Yes, Ubuntu 20.04 and 22.04. But we've been running ZFS in some
         | form or other for 10+ years. ACL support not as good/easy to
         | use as Solaris/FreeBSD. Not having weird pathological
         | performance issues with kernel memory allocation like we had
         | with FreeBSD though. Sometimes we have issues with automatic
         | pool import on boot, so that's something to be careful with.
         | The tooling is great though, and we've never had catastrophic
         | failure that was due to ZFS, only due to failing hardware.
        
         | DvdGiessen wrote:
         | In production on SmartOS (illumos) servers running applications
         | and VM's, on TrueNAS and plain FreeBSD for various storage and
         | backups, and on a few Linux-based workstations. Using mirrors
         | and raidz2 depending on the needs of the machines.
         | 
         | We've successfully survived numerous disk failures (a broken
         | batch of HDD's giving all kinds of small read errors, an SSD
         | that completely failed and disappeared, etc), and were in most
         | cases able to replace them without a second of downtime (would
         | have been all cases if not for disks placed in hard-to-reach
         | places, now only a few minutes downtime to physically swap the
         | disk).
         | 
         | Snapshots work perfectly as well. Systems are set up to
         | automatically make snapshots using [1], on boot, on a timer,
         | and right before potentially dangerous operations such as
         | package manager commands as well. I've rolled back after
         | botched OS updates without problems; after a reboot the machine
         | was back in it's old state. Also rolled back a live system a
         | few times after a broken package update, restoring the
         | filesystem state without any issues. Easily accessing old
         | versions of a file is an added bonus which has been helpful a
         | few times.
         | 
         | Send/receive is ideal for backups. We are able to send
         | snapshots between machines, even across different OSes, without
         | issues. We've also moved entire pools from one OS to another
         | without problems.
         | 
         | Knowing we have automatic snapshots and external backups
         | configured also allows me to be very liberal with giving root
         | access to inexperienced people to various (non-critical)
         | machines, knowing that if anything breaks it will always be
         | easy to roll back, and encouraging them to learn by
         | experimenting a bit, to the point where we can even diff
         | between snapshots to inspect what changed and learn from that.
         | 
         | Biggest gotchas so far have been on my personal Arch Linux
         | setup, where the out-of-tree nature of ZFS has caused some
         | issues like a incompatible kernel being installed, the ZFS
         | module failing to compile, and my workstation subsequently
         | being unable to boot. But even that was solved by my entire
         | system running on ZFS: a single rollback from my bootloader [2]
         | and all was back the way it was before.
         | 
         | Having good tooling set up definitely helped a lot. My monkey
         | brain has the tendency to think "surely I got it right this
         | time, so no need to make a snapshot before trying out X!",
         | especially when experimenting on my own workstation. Automating
         | snapshots using a systemd timer and hooks added to my package
         | manager saved me a number of times.
         | 
         | [1]: https://github.com/psy0rz/zfs_autobackup [2]:
         | https://zfsbootmenu.org/
        
         | enneff wrote:
         | I use ZFS on Debian for my home file server. The setup is just
         | a tiny NUC with a couple of large USB hard drives, mirrored
         | with ZFS. I've had drives fail and painlessly replaced and
         | resilvered them. This is easily the most hassle free file
         | storage setup I've owned; been going strong over 10 years now
         | with little to no maintenance.
        
         | crest wrote:
         | I use it as my default file system on FreeBSD. It was rough in
         | FreeBSD 7.x (around 2009), but starting with FreeBSD 8.x it has
         | been rock solid to this day. The only gotcha (which the
         | documentation warns about) has been that automatic block level
         | deduplication is only useful in a few special applications and
         | has a large main memory overhead unless you can accept terrible
         | performance for normal operations (e.g. a bandwidth limited
         | offsite backup).
        
         | yjftsjthsd-h wrote:
         | Sure; we get good mileage out of compression and snapshots
         | (well, mostly send-recv for moving data around rather than
         | snapshots in-place). I think the only problems have been very
         | specific to our install process (non-standard kernel in the
         | live environment; if we used the normal distro install process
         | it would be fine).
        
         | mattjaynes wrote:
         | ZFS on Linux has improved a lot in the last few years. We
         | (prematurely) moved to using it in production for our MySQL
         | data about 5 years ago and initially it was a nightmare due to
         | unexplained stalling which would hang MySQL for 15-30 minutes
         | at random times. I'm sure it shortened my life a few years
         | trying to figure out what was wrong when everything was on
         | fire. Fortunately, they have resolved those issues in the
         | subsequent releases and it's been much more pleasant after
         | that.
        
         | SkyMarshal wrote:
         | Not in production, but using ZoL on my personal workstations.
         | https://zfsonlinux.org/
         | 
         | Some discussion:
         | https://www.reddit.com/r/NixOS/comments/ops0n0/big_shoutout_...
        
         | szundi wrote:
         | Yes and it is awesome, no issues.
        
         | unixhero wrote:
         | Yes. Using latest ZFS ZFS On Linux distrib on Debian. Using
         | Proxmox. Never had and problems ever, ever.
        
         | benlivengood wrote:
         | "production" at home on Debian 11, previously on FreeBSD 10-13.
         | The weirdest gotcha has been related to sending encrypted raw
         | snapshots to remote machines[0],[1]. These have been the first
         | instabilities I had with ZFS in roughly 15 years around the
         | filesystem since switching to native encryption this year.
         | Native encryption seems to be barely stable for production use;
         | no actual data corruption but automatic synchronization (I use
         | znapzend) was breaking frequently. Recent kernel updates fixed
         | my problem although some of the bug reports are still open. I
         | only moved on from FreeBSD because of more familiarity with
         | Linux.
         | 
         | A slightly annoying property of snapshots and clones is the
         | inability to fully re-root a tree of snapshots, e.g.
         | permanently split a clone from its original source and allow
         | first-class send/receive from that clone. The snapshot which
         | originated the clone needs to stick around forever[2]. This
         | prevents a typical virtual machine imagine process of keeping a
         | base image up to date over time that VMs can be cloned from
         | when desired and eventually removing the storage used by the
         | original base image after e.g. several OS upgrades.
         | 
         | I don't have any big performance requirements and most file
         | storage is throughput based on spinning disks which can easily
         | saturate the gigabit network.
         | 
         | I also use ZFS on my laptop's SSD under Ubuntu with about 1GB/s
         | performance and no shortage of IOPS and the ability to send
         | snapshots off to the backup system which is pretty nice. Ubuntu
         | is going backwards on support for ZFS and native encryption
         | uses a hacky intermediate key under LUKS, but it works.
         | 
         | [0] https://github.com/openzfs/zfs/issues/12014 [1]
         | https://github.com/openzfs/zfs/issues/12594
         | [2]https://serverfault.com/questions/265779/split-a-zfs-clone
        
       | albertzeyer wrote:
       | Also see this issue: https://github.com/openzfs/zfs/issues/405
       | 
       | > It is in FreeBSD main branch now, but disabled by default just
       | to be safe till after 14.0 released, where it will be included.
       | Can be enabled with loader tunable there.
       | 
       | > more code is needed on the ZFS side for Linux integration. A
       | few people are looking at it AFAIK.
        
       | vlovich123 wrote:
       | Do Btrfs or ext4 offer this?
        
         | thrtythreeforty wrote:
         | Btrfs yes, ext4 no (but I believe xfs does).
         | 
         | This should end up being exposed through cp --reflink=always,
         | so you could look up filesystem support for that.
        
           | danudey wrote:
           | XFS does, I've used it for specifically this feature before.
        
         | wtallis wrote:
         | This feature is basically the same as what underpins the
         | reflink feature that btrfs has supported approximately forever
         | and xfs has supported for at least several years.
        
           | mustache_kimono wrote:
           | Does anyone know whether btrfs or XFS support reflinks from
           | snapshot datasets?
        
             | Dylan16807 wrote:
             | I can confirm BTRFS yes, but note that source and
             | destination need to be on the same mount point before
             | kernel 5.18
        
             | ComputerGuru wrote:
             | XFS doesn't have native snapshot support, though?
        
             | danudey wrote:
             | XFS doesn't have snapshot support, so the short answer
             | there is no.
        
               | mustache_kimono wrote:
               | Shows what I know about XFS. Thanks!
        
               | PlutoIsAPlanet wrote:
               | You can get psuedo-snapshots on XFS with a tool like
               | https://github.com/aravindavk/reflink-snapshot
               | 
               | But, it still has to duplicate metadata which depending
               | on the amount of files may cause inconsistency in the
               | snapshot.
        
               | plq wrote:
               | This is only a tangent given we are talking about
               | snapshots and reflink, but just wanted to mention that
               | LVM has snapshots, so if you need XFS snapshots, create
               | the XFS filesystem on top of an LVM logical volume.
        
         | dsr_ wrote:
         | You can get a similar effect on top of any file system that
         | supports hard links with rdfind ( https://rdfind.pauldreik.se/
         | ) -- but it's pretty slow.
         | 
         | The Arch wiki says:
         | 
         | "Tools dedicated to deduplicate a Btrfs formatted partition
         | include duperemove, bees, bedup and btrfs-dedup. One may also
         | want to merely deduplicate data on a file based level instead
         | using e.g. rmlint, jdupes or dduper-git. For an overview of
         | available features of those programs and additional
         | information, have a look at the upstream Wiki entry.
         | 
         | Furthermore, Btrfs developers are working on inband (also known
         | as synchronous or inline) deduplication, meaning deduplication
         | done when writing new data to the filesystem. Currently, it is
         | still an experiment which is developed out-of-tree. Users
         | willing to test the new feature should read the appropriate
         | kernel wiki page."
        
           | someplaceguy wrote:
           | > You can get a similar effect on top of any file system that
           | supports hard links with rdfind (
           | https://rdfind.pauldreik.se/ ) -- but it's pretty slow.
           | 
           | It's a similar effect only if you don't modify the files, I
           | think.
           | 
           | If you "clone" a file with a hard link and you modify the
           | contents of one copy, the other copy would also be equally
           | modified.
           | 
           | As far as I understand this wouldn't happen with this type of
           | block cloning: each copy of the file would be completely
           | separate, except that they may (transparently) share data
           | blocks on disk.
        
       ___________________________________________________________________
       (page generated 2023-07-04 23:00 UTC)