[HN Gopher] ZFS 2.2.0 (RC): Block Cloning merged ___________________________________________________________________ ZFS 2.2.0 (RC): Block Cloning merged Author : turrini Score : 176 points Date : 2023-07-04 15:46 UTC (7 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | miohtama wrote: | What are applications that benefit from block cloning? | aardvark179 wrote: | It can be a really convenient way to snapshot something if you | can arrange some point at which everything is synced to disk. | Get to that point, make your new files that start sharing all | their blocks, and then let your main db process (or whatever) | continue on as normal. | ikiris wrote: | I think the big piece is native overlayfs so k8 setups get a | bit simpler. | philsnow wrote: | It seems kind of like hard linking but with copy-on-write for | the underlying data, so you'll get near-instant file copies and | writing into the middle of them will also be near-instant. | | All of this happens under the covers already if you have dedup | turned on, but this allows utilities (gnu cp might be taught to | opportunistically and transparently use the new clone zfs | syscalls, because there is no downside and only upside) and | applications to tell zfs that "these blocks are going to be the | same as those" without zfs needing to hash all the new blocks | and compare them. | | Aditionally, for finer control, ranges of blocks can be cloned, | not just entire files. | | I can't tell from the github issue, can this manual dedup / | block cloning be turned on if you're not already using dedup on | a dataset? Last time I set up zfs, I was warned that dedup took | gobs of memory, so I didn't turn it on. | rincebrain wrote: | It's orthogonal to dedup being on or off, and as someone else | said, it's more or less the same underlying semantics you | would expect from cp --reflink anywhere. | | Also, as mentioned, on Linux, it's not wired up with any | interface to be used at all right now. | nabla9 wrote: | Gnu cp --reflink. | | >When --reflink[=always] is specified, perform a lightweight | copy, where the data blocks are copied only when modified. If | this is not possible the copy fails, or if --reflink=auto is | specified, fall back to a standard copy. Use --reflink=never | to ensure a standard copy is performed." | danudey wrote: | As others have said: block cloning (the underlying technology | that enables copy-on-write) allows you to 'copy' a file without | reading all of the data and re-writing it. | | For example, if you have a 1 GB file and you want to make a | copy of it, you need to read the whole file (all at once or in | parts) and then write the whole new file (all at once or in | parts). This results in 1 GB of reads and 1 GB of writes. | Obviously the slower (or more overloaded) your storage media | is, the longer this takes. | | With block cloning, you simply tell the OS "I want this file A | to be a copy of this file B" and it creates a new "file" that | references all the blocks in the old "file". Given that a | "file" on a filesystem is just a list of blocks that make up | the data in that file, you can create a new "file" which has | pointers to the same blocks as the old "file". This is a simple | system call (or a few system calls), and as such isn't much | more intensive than simply renaming a file instead of copying | it. | | At my previous job we did builds for our software. This | required building the BIOS, kernel, userspace, generating the | UI, and so on. These builds required pulling down 10+ GB of git | repositories (the git data itself, the checkout, the LFS binary | files, external vendor SDKs), and then a large amount of build | artifacts on top of that. We also needed to do this build for | 80-100 different product models, for both release and debug | versions. This meant 200+ copies of the source code alone (not | to mention build artifacts and intermediate products), and | because of disk space limitations this meant we had to | dramatically reduce the number of concurrent builds we could | run. The solution we came up with was something like: | | 1. Check out the source code | | 2. Create an overlayfs filesystem to mount into each build | space | | 3. Do the build | | 4. Tear down the overlayfs filesystem | | This was problematic if we weren't able to mount the | filesystem, if we weren't able to unmount the filesystem | (because of hanging file descriptors or processes), and so on. | Lots of moving parts, lots of `sudo` commands in the scripts, | and so on. | | Copy-on-write would have solved this for us by accomplishing | the same thing; we could simply do the following: | | 1. Check out the source code | | 2. Have each build process simply `cp -R --reflink=always | source/ build_root/`; this would be instantaneous and use no | new disk space. | | 3. Do the build | | 4. `rm -rf build_root` | | Fewer moving parts, no root access required, generally simpler | all around. | thrill wrote: | FTFA: "Block Cloning allows to clone a file (or a subset of its | blocks) into another (or the same) file by just creating | additional references to the data blocks without copying the | data itself. Block Cloning can be described as a fast, manual | deduplication." | the8472 wrote: | Any copy command. On-demand deduplication managed by userspace. | | https://man7.org/linux/man-pages/man2/ioctl_fideduperange.2.... | https://man7.org/linux/man-pages/man2/copy_file_range.2.html | https://github.com/markfasheh/duperemove | vovin wrote: | This is huge. One practical application is fast recovery of a | file from past snapshot without using any additional space. I | use ZFS dataset for my vCenter datastore (storing my vmdk | files). In case of need to launch a clone from a past state one | could use a block cloning to bring past vmdk file without the | need to actually copy the file - it saves both space and time | to make such clone. | bithavoc wrote: | Can you elaborate a bit more on how you use ZFS with vCenter? | How do you mount it? | mustache_kimono wrote: | Excited, because in addition to ref copies/clones, httm will | use this feature, if available (I've already done some work to | implement), for its `--roll-forward` operation, and for faster | file recoveries from snapshots [0]. | | As I understand it, there will be no need to copy any data from | the same dataset, and _this includes all snapshots_. Blocks | written to the live dataset can just be references to the | underlying blocks, and no additional space will need be used. | | Imagine being able to continuously switch a file or a dataset | back to a previous state extremely quickly without a heavy | weight clone, or a rollback, etc. | | Right now, httm simply diff copies the blocks for file recovery | and roll-forward. For further details, see the man page entry | for `--roll-forward`, and the link to the httm GitHub below: | --roll-forward="snap_name" traditionally 'zfs | rollback' is a destructive operation, whereas httm roll-forward | is non-destructive. httm will copy only the blocks and file | metadata that have changed since a specified snapshot, from | that snapshot, to its live dataset. httm will also take two | precautionary snapshots, one before and one after the copy. | Should the roll forward fail for any reason, httm will roll | back to the pre-execution state. Note: This is a ZFS only | option which requires super user privileges. | | [0]: https://github.com/kimono-koans/httm | rossmohax wrote: | Does ZFS or any other FS offer special operations which DB engine | like RocksDB, SQLite or PostgreSQL could benefit from if they | decided to target that FS specifically? | magicalhippo wrote: | Internally, ZFS is kinda like an object store[1], and there was | a project trying to expose the ZFS internals through an object | store API rather than through a filesystem API. | | Sadly I can't seem to find the presentation or recall the name | of the project. | | On the other hand, looking at for example RocksDB[2]: | | _File system operations are not atomic, and are susceptible to | inconsistencies in the event of system failure. Even with | journaling turned on, file systems do not guarantee consistency | on unclean restart. POSIX file system does not support atomic | batching of operations either. Hence, it is not possible to | rely on metadata embedded in RocksDB datastore files to | reconstruct the last consistent state of the RocksDB on | restart. RocksDB has a built-in mechanism to overcome these | limitations of POSIX file system [...]_ | | ZFS _does_ provide atomic operations internally[1], so if | exposed it seems something like RocksDB could take advantage of | that and forego all the complexity mentioned above. | | How much that would help I don't know though, but seems | potentially interesting at first glance. | | [1]: https://youtu.be/MsY-BafQgj4?t=442 | | [2]: https://github.com/facebook/rocksdb/wiki/MANIFEST | ludde wrote: | A M A Z I N G | | Have been looking forward to this for years! | | This is so much better than automatically doing dedup and the RAM | overhead that entails. | | Doing offline/RAM+in memory dedup size optimizations seem like a | really good optimization path. In the spirit of also paying only | what you use and not the rest. | | Edit: What's the RAM overhead of this? Is it ~64B per 128kB | deduped block or what's the magnitude of things? | mlyle wrote: | > Edit: What's the RAM overhead of this? Is it ~64B per 128kB | deduped block or what's the magnitude of things? | | No real memory impact. There's a regions table that uses 128k | of memory per terabyte of total storage (and may be a bit more | in the future). So for your 10 petabyte pool using deduping, | you'd better have an extra gigabyte of RAM. | | But erasing files can potentially be twice as expensive in | IOPS, even if not deduped. They try to prevent this. | uvatbc wrote: | Technically, yes: through the use of Truenas that gives us API | access to iscsi on ZFS. | GauntletWizard wrote: | > Note: currently it is not possible to clone blocks between | encrypted datasets, even if those datasets use the same | encryption key (this includes snapshots of encrypted datasets). | Cloning blocks between datasets that use the same keys should be | possible and should be implemented in the future. | | Once this is ready, I am going to subdivide my user homedir much | more than it already is. The biggest obstacle in the way of this | has been that it would waste a bunch of space until the snapshots | were done rolling over, which for me is a long time (I keep | weekly snapshots of my homedir for a year). | yjftsjthsd-h wrote: | Is there a benefit to breaking up your home directory? | GauntletWizard wrote: | Controlling the rate and location of snapshots, mostly. I've | broken out some kinds of datasets (video archives) but not | others historically (music). It doesn't matter that much, but | I want to split some more chunks out. | yjftsjthsd-h wrote: | Fair enough. I've personally slowly moved to a smaller | number of filesystems, but if you're actually handling | snapshots differently per-area then it makes sense (indeed, | one of the reasons I'm consolidating is the realization | that _personally_ I 'm almost never going to | snapshot/restore things separately). | someplaceguy wrote: | Not sure why you'd want to do that to your home directory | usually, but it depends on what you store in it and how you | use it, really. | | In general, breaking up a filesystem into multiple ones in | ZFS is mostly useful for making filesystem management more | fine-grained, as a filesystem/dataset in ZFS is the unit of | management for most properties and operations (snapshots, | clones, compression and checksum algorithms, quotas, | encryption, dedup, send/recv, ditto copies, etc) as well as | their inheritance and space accounting. | | In terms of filesystem management, there aren't many | downsides to breaking up a filesystem (within reason), as | most properties and the most common operations can be shared | between all sub-filesystems if they are part of the same | inherited tree (which doesn't necessarily have to correspond | to the mountpoint tree!). | | As far as I know, the major downsides by far were that 1) you | couldn't quickly move a file from one dataset to another, | i.e. `mv` would be forced to do a full copy of the file | contents rather than just do a cheap rename, and 2) in terms | of disk space, moving a file between filesystems would be | equivalent to copying the file and deleting the original, | which could be terrible if you use snapshots as it would lead | to an additional space consumption of a full new file's worth | of disk space. | | In principle, both of these downsides should be fixed with | this new block cloning feature and AFAIU the only tradeoffs | would be some amount of increased overhead when freeing data | (which should be zero overhead if you don't have many of | these cloned blocks being shared anymore), and the low | maturity of this code (i.e. higher chance of running into | bugs) due to being so new. | dark-star wrote: | Wow, I was under the impression that this had long been | implemented already (as it's already in btrfs and other | commercial file systems) | | Awesome! | mgerdts wrote: | It has been in the Solaris version of zfs for a long time as | well. This came a few years after the Oracle-imposed fork. | | https://blogs.oracle.com/solaris/post/reflink3c-what-is-it-w... | Pxtl wrote: | Filesystem-level de-duplication is scary as hell as a concept, | but also sounds amazing, especially doing it at copy-time so you | don't have to opt-in to scanning to deduplicate. Is this common | in filesystems? Or is ZFS striking out new ground here? I'm not | really an under-the-OS-hood kinda guy. | yjftsjthsd-h wrote: | > Filesystem-level de-duplication is scary as hell as a concept | | What's scary about it? You have to track references, but it | doesn't seem _that_ hard compared to everything else going on | in ZFS et al. | | > Is this common in filesystems? Or is ZFS striking out new | ground here? At least BTRFS does approximately the same. | cesarb wrote: | > > Filesystem-level de-duplication is scary as hell as a | concept | | > What's scary about it? | | It's scary because there's only one copy when you might have | expected two. A single bad block could lose both "copies" at | once. | phpisthebest wrote: | 3-2-1.. | | 3 Copies | | 2 Media | | 1 offsite. | | If you follow that then you would have no fear of data | loss. if you are putting 2 copies on the same filesystem | you are already doing backups wrongs | magicalhippo wrote: | Besides what all the others have mentioned, you can force | ZFS to keep up to 3 copies of data blocks on a dataset. ZFS | uses this internally for important metadata and will try to | spread them around to maximize the chance of recovery, | though don't rely on this feature alone for redundancy. | Filligree wrote: | Disks die all the time anyway. If you want to keep your | data, you should have at least two-disk redundancy. In | which case bad blocks won't kill anything. | ForkMeOnTinder wrote: | Copying a file isn't great protection against bad blocks. | Modern SSDs, when they fail, tend to fail catastrophically | (the whole device dies all at once), rather than dying one | block at a time. If you care about the data, back it up on | a separate piece of hardware. | grepfru_it wrote: | The file system metadata is redundant and on a correctly | configured ZFS system your error correction is isolated and | can be redundant as well | Pxtl wrote: | > What's scary about it? | | Just that I'm trusting the OS to re-duplicate it at block | level on file write. The idea that block by block you've got | "okay, this block is shared by files XYZ, this next block is | unique to file Z, then the next block is back to XYZ... oh | we're editing that one? Then it's a new block that's now | unique to file Z too". | | I guess I'm not used to trusting filesystems to do anything | but dumb write and read. I know they abstract away a crapload | of amazing complexity in reality, I'm just used to thinking | of them as dumb bags of bits. | Dylan16807 wrote: | If you're on ZFS you're probably using snapshots, so all | that work is already happening. | wrs wrote: | MacOS and BTRFS have had it for several years. In fact I | believe it's the default behavior when copying a file in MacOS | using the Finder (you have to specify `cp -c` in shell). | nijave wrote: | Windows Server has had dedupe for at least 10 years, too | lockhouse wrote: | Anyone here using ZFS in production these days? If so what OS and | implementation? What have been your experiences or gotchas you | experienced? | Drybones wrote: | We use ZFS on every server we deploy | | We typically use Proxmox. It's a convenient node host setup and | usually has a very up to date zfs and it's stable | | I just wouldn't use the Proxmox web ui for zfs configuration. | It doesn't have up to date options. Always configure zfs on the | cli | shrubble wrote: | The gotcha on Proxmox is that you can't do swapfiles on ZFS, so | if your swap isn't made big enough when installing and you | format everything as ZFS you have to live with it or do odd | workarounds. | trws wrote: | I'm not a filesystem admin, but we at LLNL use OpenZFS as the | storage layer for all of our Lustre file systems in production, | including using raid-z for resilience in each pool (order 100 | disks each), and have for most of a decade. That combined with | improvements in Lustre have taken the rate of data loss or need | to clear large scale shared file systems down to nearly zero. | There's a reason we spend as many engineer hours as we do | maintaining it, it's worth it. | | LLNL openzfs project: | https://computing.llnl.gov/projects/openzfs Old presentation | from intel with info on what was one of our bigger deployments | in 2016 (~50pb): | https://www.intel.com/content/dam/www/public/us/en/documents... | muxator wrote: | If I'm not mistaken the linux port of ZFS that later became | OpenZFS started at LLNL and was a port from FreeBSD (it may | have been release ~9). | | I believe it was called ZFS On Linux or something like that. | | Nice how things have evolved: from FreeBSD to linux and back. | In my mind this has always been a very inspiring example of a | public institution working for the public good. | rincebrain wrote: | FreeBSD had its own ZFS port. | | ZoL, if my ancient memory serves, was at LLNL, not based on | the FreeBSD port (if you go _very_ far back in the commit | history you can see Brian rebasing against OpenSolaris | revisions), but like 2 or 3 different orgs originally | announced Linux ports at the same time and then all pooled | together, since originally only one of the three was going | to have a POSIX layer (the other two didn't need a working | POSIX filesystem layer). (I'm not actually sure how much | came of this collaboration, I just remember being very | amused when within the span of a week or two, three | different orgs announced ports, looked at each other, and | went "...wait.") | | Then for a while people developed on either the FreeBSD | port, the illumos fork called OpenZFS, or the Linux port, | but because (among other reasons) a bunch of development | kept happening on the Linux port, it became the defacto | upstream and got renamed "OpenZFS", and then FreeBSD more | or less got a fresh port from the OpenZFS codebase that is | now what it's based on. | | The macOS port got a fresh sync against that codebase | recently and is slowly trying to merge in, and then from | there, ??? | justinclift wrote: | TrueNAS (www.truenas.com) uses ZFS for the storage layer across | it's product range (storage software). | | They have both FreeBSD and Linux based stuff, targeting | different use cases. | quags wrote: | I have been using zfs for years from ubuntu 18. Easy snapshots, | monitoring, choices for raid levels, and ability to very easily | copy a dataset remotely with resume or incremental support is | awesome. I mainly use it for kvm systems each with their own | dataset. Coming from mdadm + lvm from my previous set up is | night and day for doing snapshots and backups. I do not use zfs | on root for ubuntu instead I do a raid1 software set up for the | os and then a zfs set up on other disks - zfs on root was the | only gotcha. For FreeBSD zfs on root works fine. | noaheverett wrote: | Running in production for about 3 years with Ubuntu 20.04 / zfs | 0.8.3. ZFS is being used as the datastore for a cluster of | LXD/LXC instances over multiple physical hosts. I have the OS | setup on its own dedicated drive and ZFS striped/cloned over 4 | NVMe drives. | | No gotchas / issues, works well, easy to setup. | | I am looking forward to the Direct IO speed improvements for | NVMe drives with https://github.com/openzfs/zfs/pull/10018 | | edit: one thing I forgot to mention is, when creating your pool | make sure to import your drives by ID (zpool import -d | /dev/disk/by-id/ <poolname>) instead of name in case name | assignments change somehow [1] | | [1] https://superuser.com/questions/1732532/zfs-disk-drive- | lette... | throw0101c wrote: | > _I am looking forward to the Direct IO speed improvements | for NVMe drives | withhttps://github.com/openzfs/zfs/pull/10018_ | | See also "Scaling ZFS for NVMe" by Allan Jude at EuroBSDcon | 2022: | | * https://www.youtube.com/watch?v=v8sl8gj9UnA | noaheverett wrote: | Sweet, I appreciate the link! | gigatexal wrote: | Any of the enterprise customers of klara Systems are likely ZFS | production folks. | | https://klarasystems.com/?amp | drewg123 wrote: | We use ZFS in production for the non-content filesystems on a | large, ever increasing, percentage of our Netflix Open Connect | CDN nodes, replacing geom's gmirror. We had one gotcha (caught | on a very limited set of canaries) where a buggy bootloader | crashed part-way through boot, leaving the ZFS "bootonce" stuff | in a funky state requiring manual recovery (the nodes with | gmirror were fine, and fell back to the old image without | fuss). This has since been fixed. | | Note that we _do not_ use ZFS for content, since it is | incompatible with efficient use of sendfile (both because there | is no async handler for ZFS, so no async sendfile, and because | the ARC is not integrated with the page cache, so content would | require an extra memory copy to be served). | lifty wrote: | what do you use for the content filesystem? | drewg123 wrote: | FreeBSD's UFS | throw0101c wrote: | Any use of boot environments for easy(er?) rollbacks of OS | updates? | drewg123 wrote: | Yes. That's the bootonce thing I was talking about. When we | update the OS, we set the "bootonce" flag via bectl | activate -t to ensure we fall back to the previous BE if | the current BE is borked and not bootable. This is the same | functionality we had by keeping a primary and secondary | root partition in geom a and toggling the bootable | partition via the bootonce flag in gpart. | bakul wrote: | Is integrating ARC with the page cache a lost cause? If not, | may be Netflix can fund it! | postmodest wrote: | I would expect page cache to be one of those platform- | dependent things that prevent OpenZFS from doing that. | Especially on Linux, and especially because AFAIK the Linux | version has the most eyes on it. | rincebrain wrote: | My understanding, not having dug into it, is that it's | possible but just work nobody has done yet, though I'm | not sure what the relevant interfaces are in the Linux | kernel. | | One thing that makes the interfaces in Linux much messier | than the FreeBSD ones is that a lot of the core | functionality you might like to leverage in Linux | (workqueues, basically anything more complicated than | just calling kmalloc, and of course any SIMD save/restore | state, to name three examples I've stumbled over | recently) are marked EXPORT_SYMBOL_GPL or just entirely | not exported in newer releases, so you get to reimplement | the wheel for those, whereas on FreeBSD it's trivial to | just use their implementations of such things and shim | them to the Solaris-ish interfaces the non-platform- | specific code expects. | | So that makes the Linux-specific code a lot heavier, | because upstream is actively hostile. | Dylan16807 wrote: | > SIMD save/restore state | | I wish someone would come in and convince the kernel devs | that "hey, if you want EXPORT_SYMBOL_GPL to have legal | weight in a copyleft sense then you can't just slap it | onto interfaces for political reasons" | rincebrain wrote: | I don't think they care about it having legal weight, | that ship sailed long ago when they started advocating | for just slapping SYMBOL_GPL on things out of spite; I | think they care about excluding people from using their | software. | | IMO Linus should stop being half-and-half about it and | either mark everything SYMBOL_GPL and see how well that | goes or stop this nonsense. | Filligree wrote: | I just don't understand why they're so anti-ZFS. I want | my data to survive, please... | rincebrain wrote: | My impression is that some of the Linux kernel devs are | anti-anything that's not GPL-compatible, of any sort, | regardless of the particulars. | | Linus himself also made remarks about ZFS at one point | that were pretty...hostile. [1] [2] | | > The fact is, the whole point of the GPL is that you're | being "paid" in terms of tit-for-tat: we give source code | to you for free, but we want source code improvements | back. If you don't do that but instead say "I think this | is _legal_, but I'm not going to help you" you certainly | don't get any help from us. | | > So things that are outside the kernel tree simply do | not matter to us. They get absolutely zero attention. We | simply don't care. It's that simple. | | > And things that don't do that "give back" have no | business talking about us being assholes when we don't | care about them. | | > See? | | Note that there's at least one unfixed Linux kernel bug | that was found by OpenZFS users, reproducible without | using OpenZFS in any way, reported with a patch, and | ignored. [3] | | So "not giving back" is a dubious claim. | | [1] - https://arstechnica.com/gadgets/2020/01/linus- | torvalds-zfs-s... | | [2] - https://www.realworldtech.com/forum/?threadid=18971 | 1&curpost... | | [3] - https://bugzilla.kernel.org/show_bug.cgi?id=212295 | colonwqbang wrote: | Why don't you think it has legal weight? Or did you mean | something else? | | As far as know the point of EXPORT_SYMBOL_GPL was to push | back on companies like Nvidia who wanted to exploit | loopholes in the GPL. That seems to me like a reasonable | objective. | | Relevant Torvalds quote: | https://yarchive.net/comp/linux/export_symbol_gpl.html | rincebrain wrote: | Sure, and that alone isn't an unreasonable premise - as | he says, intent matters. | | But if you're marking interfaces as GPL-only, or | implementing taint detection that means if you use a non- | SYMBOL_GPL kernel symbol which calls a GPL-only function | it treats the non-SYMBOL_GPL symbol as GPL-only and | blocks your linking, it gets a bit out of hand. | | Building the kernel with certain kernel options makes | modules like OpenZFS or OpenAFS not link because of that | taint propagation - because things like the lockdep | checker turn uninfringing calls into infringing ones. | | Or a little while ago, there was a change which broke | building on PPC because a change made a non-SYMBOL_GPL | call on POWER into a SYMBOL_GPL one indirectly, and when | the original author was contacted, he sent a patch | reverting the changed symbol, and GregKH refused to pull | it into stable, suggesting distros could carry it if they | wanted to. (Of course, he had happily merged a change | into -stable earlier that just implemented more | aggressive GPL tainting and thereby broke things like the | aforementioned...) | PlutoIsAPlanet wrote: | The Linux kernel has never supported out of tree modules | like how ZFS works out of tree. | | All ZFS needs to do is just have one of Oracles many | lawyers say "CDDL is compatible with GPL". Yet, they | Oracle don't. | rincebrain wrote: | "All." | | It's explicitly not compatible with GPL, though. It has | clauses that are more restrictive than GPL, and IIRC some | people who contributed to the OpenZFS project did so | explicitly without allowing later CDDL license revisions, | which removes Oracle's ability to say CDDL-2 or whatever | is GPL-compatible. | | So even if someone rolled up dumptrucks of cash and | convinced Oracle that everything was great, they don't | have all the control needed to do that. | Dylan16807 wrote: | To have legal weight, it has to be a signal that you're | implementing something that is derivative of kernel code. | That's the directly stated intent of EXPORT_SYMBOL_GPL. | | But "call an opaque function that saves SIMD state" is | obviously not derivative of the kernel code in any way. | The more exports that get badly marked this way, the more | EXPORT_SYMBOL_GPL becomes indistinguishable from | EXPORT_SYMBOL. | colonwqbang wrote: | I see it as just a kind of "warranty void if seal | broken". Don't do this or you _may_ be in violation of | the GPL. Maybe a legal court in $country would find in | your favour (I 'm not convinced it's as clear cut as you | imply). Maybe they would find that you willfully | infringed, despite the kernel devs clearly warning you | not to do it. | | The main "legal effect" I see is that you are not willing | to take that risk, just like Oracle isn't. | bakul wrote: | I suspect the underlying issues for not unifying the two | have more to do with the ZFS design than anything to do | with Linux. It may be the codebase is far too large at | this stage to make such a fundamental change. | rincebrain wrote: | I don't think so. The memory management stuff is pretty | well abstracted; on FBSD it just glues into UMA pretty | transparently, it's just on Linux there's a lot of | machinery for implementing our own little cache | allocating because Linux's kernel cache allocator is very | limited in what sizes it will give you, and sometimes ZFS | wants 16M (not necessarily contiguous) regions because | someone said they wanted 16M records. | | The ZoL project lead said at one point there were a | variety of reasons this wasn't initially done for the | Linux integration [1], but that it was worth taking | another look at since that was a decade ago now. Having | looked at the Linux memory subsystems recently for | various reasons, I would suspect the limiting factor is | that almost all the Linux memory management functions | that involve details beyond "give me X pages" are | SYMBOL_GPL, so I suspect we couldn't access whatever | functionality would be needed to do this. | | I could be wrong, though, as I wasn't looking at the code | for that specific purpose, so I might have missed | functionality that would provide this. | | [1] - https://github.com/openzfs/zfs/issues/10255#issueco | mment-620... | bakul wrote: | Behlendorf's comment in that thread seems to be talking | about linux integration. My point was this is an older | issue, going back to the Sun days. See for instance this | thread in where McVoy complains about the same issue! htt | ps://www.tuhs.org/pipermail/tuhs/2021-February/023013.htm | ... | rincebrain wrote: | That seems more like it's complaining about it not being | the actual page cache, not it not being counted as | "cache", which is a larger set in at least Linux than | just the page cache itself. | | But sure, it's certainly an older issue, and given that | the ABD rework happened, I wouldn't put anything past | being "feasible" if the benefits were great enough. | | (Look at the O_DIRECT zvol rework stuff that's pending (I | believe not merged) for how a more cut-through memory | model could be done, though that has all the tradeoffs | you might expect of skipping the abstractions ZFS uses to | minimize the ability of applications to poke holes in the | abstraction model and violate consistency, I believe...) | the8472 wrote: | Could the linux integration use dax[0] to bypass the page | cache and go straight to ARC? | | [0] https://www.kernel.org/doc/Documentation/filesystems/ | dax.txt | gigatexal wrote: | This is amazing. A detail of Netflix that, I a plebe, | wouldn't know if not for this site. | ComputerGuru wrote: | Actually, Drew's presentations about Netflix, FreeBSD, ZFS, | saturating high-bandwidth network adapters, etc. are | legendary and have been posted far and wide. But having him | available to answer questions on HN just takes it to a | whole 'nother level. | drewg123 wrote: | You're making me blush.. But, to set the record straight: | I actually know very little about ZFS, beyond basic | user/admin knowledge (from having run it for ~15 years). | I've never spoken about it, and other members of the team | I work for at Netflix are far more knowledgeable about | ZFS, and are the ones who have managed the conversion of | our fleet to ZFS for non-content partitions. | gigatexal wrote: | Have they ever blogged or spoke at conferences about it? | I soak up all that content -- least I try to. | ComputerGuru wrote: | I've devoured your FreeBSD networking presentations but I | guess I must have confused a post about tracking down a | ZFS bug in production written by someone else with all | the other content you've produced. | | Back to the topic at hand, it's actually scary how few | software expose control over whether or not sendfile is | used, assuming support is only a matter of OS and kernel | version but not taking into account filesystem | limitations. I ran into a terrible Samba on FreeBSD bug | (shares remotely disconnected and connections reset with | moderate levels of concurrent ro access from even a | single client) that I ultimately tracked down to sendfile | being enabled in the (default?) config - so it wasn't | just the expected "performance requirements not being | met" with sendfile on ZFS but even other reliability | issues (almost certainly exposing a different underlying | bug, tbh). Imagine if Samba didn't have a tubeable to | set/override sendfile support, though. | xmodem wrote: | If you can share, what type of non-content data do the nodes | store? Is this just OS+application+logs? | nightfly wrote: | Yes, Ubuntu 20.04 and 22.04. But we've been running ZFS in some | form or other for 10+ years. ACL support not as good/easy to | use as Solaris/FreeBSD. Not having weird pathological | performance issues with kernel memory allocation like we had | with FreeBSD though. Sometimes we have issues with automatic | pool import on boot, so that's something to be careful with. | The tooling is great though, and we've never had catastrophic | failure that was due to ZFS, only due to failing hardware. | DvdGiessen wrote: | In production on SmartOS (illumos) servers running applications | and VM's, on TrueNAS and plain FreeBSD for various storage and | backups, and on a few Linux-based workstations. Using mirrors | and raidz2 depending on the needs of the machines. | | We've successfully survived numerous disk failures (a broken | batch of HDD's giving all kinds of small read errors, an SSD | that completely failed and disappeared, etc), and were in most | cases able to replace them without a second of downtime (would | have been all cases if not for disks placed in hard-to-reach | places, now only a few minutes downtime to physically swap the | disk). | | Snapshots work perfectly as well. Systems are set up to | automatically make snapshots using [1], on boot, on a timer, | and right before potentially dangerous operations such as | package manager commands as well. I've rolled back after | botched OS updates without problems; after a reboot the machine | was back in it's old state. Also rolled back a live system a | few times after a broken package update, restoring the | filesystem state without any issues. Easily accessing old | versions of a file is an added bonus which has been helpful a | few times. | | Send/receive is ideal for backups. We are able to send | snapshots between machines, even across different OSes, without | issues. We've also moved entire pools from one OS to another | without problems. | | Knowing we have automatic snapshots and external backups | configured also allows me to be very liberal with giving root | access to inexperienced people to various (non-critical) | machines, knowing that if anything breaks it will always be | easy to roll back, and encouraging them to learn by | experimenting a bit, to the point where we can even diff | between snapshots to inspect what changed and learn from that. | | Biggest gotchas so far have been on my personal Arch Linux | setup, where the out-of-tree nature of ZFS has caused some | issues like a incompatible kernel being installed, the ZFS | module failing to compile, and my workstation subsequently | being unable to boot. But even that was solved by my entire | system running on ZFS: a single rollback from my bootloader [2] | and all was back the way it was before. | | Having good tooling set up definitely helped a lot. My monkey | brain has the tendency to think "surely I got it right this | time, so no need to make a snapshot before trying out X!", | especially when experimenting on my own workstation. Automating | snapshots using a systemd timer and hooks added to my package | manager saved me a number of times. | | [1]: https://github.com/psy0rz/zfs_autobackup [2]: | https://zfsbootmenu.org/ | enneff wrote: | I use ZFS on Debian for my home file server. The setup is just | a tiny NUC with a couple of large USB hard drives, mirrored | with ZFS. I've had drives fail and painlessly replaced and | resilvered them. This is easily the most hassle free file | storage setup I've owned; been going strong over 10 years now | with little to no maintenance. | crest wrote: | I use it as my default file system on FreeBSD. It was rough in | FreeBSD 7.x (around 2009), but starting with FreeBSD 8.x it has | been rock solid to this day. The only gotcha (which the | documentation warns about) has been that automatic block level | deduplication is only useful in a few special applications and | has a large main memory overhead unless you can accept terrible | performance for normal operations (e.g. a bandwidth limited | offsite backup). | yjftsjthsd-h wrote: | Sure; we get good mileage out of compression and snapshots | (well, mostly send-recv for moving data around rather than | snapshots in-place). I think the only problems have been very | specific to our install process (non-standard kernel in the | live environment; if we used the normal distro install process | it would be fine). | mattjaynes wrote: | ZFS on Linux has improved a lot in the last few years. We | (prematurely) moved to using it in production for our MySQL | data about 5 years ago and initially it was a nightmare due to | unexplained stalling which would hang MySQL for 15-30 minutes | at random times. I'm sure it shortened my life a few years | trying to figure out what was wrong when everything was on | fire. Fortunately, they have resolved those issues in the | subsequent releases and it's been much more pleasant after | that. | SkyMarshal wrote: | Not in production, but using ZoL on my personal workstations. | https://zfsonlinux.org/ | | Some discussion: | https://www.reddit.com/r/NixOS/comments/ops0n0/big_shoutout_... | szundi wrote: | Yes and it is awesome, no issues. | unixhero wrote: | Yes. Using latest ZFS ZFS On Linux distrib on Debian. Using | Proxmox. Never had and problems ever, ever. | benlivengood wrote: | "production" at home on Debian 11, previously on FreeBSD 10-13. | The weirdest gotcha has been related to sending encrypted raw | snapshots to remote machines[0],[1]. These have been the first | instabilities I had with ZFS in roughly 15 years around the | filesystem since switching to native encryption this year. | Native encryption seems to be barely stable for production use; | no actual data corruption but automatic synchronization (I use | znapzend) was breaking frequently. Recent kernel updates fixed | my problem although some of the bug reports are still open. I | only moved on from FreeBSD because of more familiarity with | Linux. | | A slightly annoying property of snapshots and clones is the | inability to fully re-root a tree of snapshots, e.g. | permanently split a clone from its original source and allow | first-class send/receive from that clone. The snapshot which | originated the clone needs to stick around forever[2]. This | prevents a typical virtual machine imagine process of keeping a | base image up to date over time that VMs can be cloned from | when desired and eventually removing the storage used by the | original base image after e.g. several OS upgrades. | | I don't have any big performance requirements and most file | storage is throughput based on spinning disks which can easily | saturate the gigabit network. | | I also use ZFS on my laptop's SSD under Ubuntu with about 1GB/s | performance and no shortage of IOPS and the ability to send | snapshots off to the backup system which is pretty nice. Ubuntu | is going backwards on support for ZFS and native encryption | uses a hacky intermediate key under LUKS, but it works. | | [0] https://github.com/openzfs/zfs/issues/12014 [1] | https://github.com/openzfs/zfs/issues/12594 | [2]https://serverfault.com/questions/265779/split-a-zfs-clone | albertzeyer wrote: | Also see this issue: https://github.com/openzfs/zfs/issues/405 | | > It is in FreeBSD main branch now, but disabled by default just | to be safe till after 14.0 released, where it will be included. | Can be enabled with loader tunable there. | | > more code is needed on the ZFS side for Linux integration. A | few people are looking at it AFAIK. | vlovich123 wrote: | Do Btrfs or ext4 offer this? | thrtythreeforty wrote: | Btrfs yes, ext4 no (but I believe xfs does). | | This should end up being exposed through cp --reflink=always, | so you could look up filesystem support for that. | danudey wrote: | XFS does, I've used it for specifically this feature before. | wtallis wrote: | This feature is basically the same as what underpins the | reflink feature that btrfs has supported approximately forever | and xfs has supported for at least several years. | mustache_kimono wrote: | Does anyone know whether btrfs or XFS support reflinks from | snapshot datasets? | Dylan16807 wrote: | I can confirm BTRFS yes, but note that source and | destination need to be on the same mount point before | kernel 5.18 | ComputerGuru wrote: | XFS doesn't have native snapshot support, though? | danudey wrote: | XFS doesn't have snapshot support, so the short answer | there is no. | mustache_kimono wrote: | Shows what I know about XFS. Thanks! | PlutoIsAPlanet wrote: | You can get psuedo-snapshots on XFS with a tool like | https://github.com/aravindavk/reflink-snapshot | | But, it still has to duplicate metadata which depending | on the amount of files may cause inconsistency in the | snapshot. | plq wrote: | This is only a tangent given we are talking about | snapshots and reflink, but just wanted to mention that | LVM has snapshots, so if you need XFS snapshots, create | the XFS filesystem on top of an LVM logical volume. | dsr_ wrote: | You can get a similar effect on top of any file system that | supports hard links with rdfind ( https://rdfind.pauldreik.se/ | ) -- but it's pretty slow. | | The Arch wiki says: | | "Tools dedicated to deduplicate a Btrfs formatted partition | include duperemove, bees, bedup and btrfs-dedup. One may also | want to merely deduplicate data on a file based level instead | using e.g. rmlint, jdupes or dduper-git. For an overview of | available features of those programs and additional | information, have a look at the upstream Wiki entry. | | Furthermore, Btrfs developers are working on inband (also known | as synchronous or inline) deduplication, meaning deduplication | done when writing new data to the filesystem. Currently, it is | still an experiment which is developed out-of-tree. Users | willing to test the new feature should read the appropriate | kernel wiki page." | someplaceguy wrote: | > You can get a similar effect on top of any file system that | supports hard links with rdfind ( | https://rdfind.pauldreik.se/ ) -- but it's pretty slow. | | It's a similar effect only if you don't modify the files, I | think. | | If you "clone" a file with a hard link and you modify the | contents of one copy, the other copy would also be equally | modified. | | As far as I understand this wouldn't happen with this type of | block cloning: each copy of the file would be completely | separate, except that they may (transparently) share data | blocks on disk. ___________________________________________________________________ (page generated 2023-07-04 23:00 UTC)