[HN Gopher] OpenZFS - add disks to existing RAIDZ
       ___________________________________________________________________
        
       OpenZFS - add disks to existing RAIDZ
        
       Author : shrubble
       Score  : 158 points
       Date   : 2023-08-19 16:35 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | atmosx wrote:
       | Finally :-)
        
       | lyjia wrote:
       | Wow, I've been hearing this has been in the works for a while. I
       | am glad to see it released! My RAIDZ array awaits new disks!
        
         | cassianoleal wrote:
         | What do you mean by released? It hasn't even been merged yet.
         | :)
        
       | znpy wrote:
       | It bothers me so much that zfs is not into mainline linux. I know
       | it's due to the license incompatibility... :(
        
         | Nextgrid wrote:
         | Has the whole license incompatibility thing actually been
         | tested/litigated in court? I heard Canonical has (at least at
         | some point) shipped prebuilt ZFS. I understand that in-tree
         | inclusion brings its own set of problems, but I'm just asking
         | about redistribution of binaries - the same binaries you are
         | allowed to build locally.
         | 
         | It would be nice to have a precedent deciding on this bullshit
         | argument once and for all so distros can freely ship prebuilt
         | binary modules and bring Linux to the modern times when it
         | comes to filesystems.
         | 
         | The whole situation is ridiculous. I'd understand if this was
         | about money, but who exactly gets hurt by users getting a
         | prebuilt module from somewhere, vs building exactly the same
         | thing locally from freely-available source?
        
           | 2OEH8eoCRo0 wrote:
           | Because it will be hella fun backing ZFS out of the kernel
           | source once it's been there for a few years.
        
           | rincebrain wrote:
           | I don't think Linux would like to ship a mass of code that
           | size that's not GPLed, even if court cases say CDDL is GPL-
           | compatible.
        
             | jonhohle wrote:
             | It isn't compatible because it adds additional restrictions
             | regarding patented code. The CDDL was made to be used in
             | mixed license distributions and only affects individual
             | files, but the GPL taints anything linked (which is why
             | LGPL exists). Since the terms of the CDDL can't be
             | respected in a GPL'd distribution, I can't see a way for it
             | to ever be included in the kernel repo.
             | 
             | I don't think there's any issue with canonical shipping a
             | kmod, but similar to 3 party binary drivers, it would need
             | to be treated as a different "work".
        
           | askiiart wrote:
           | I can confirm that Ubuntu ships prebuilt ZFS, I used a live
           | Ubuntu USB to copy some data off a ZFS pool just a couple
           | weeks ago.
        
           | chungy wrote:
           | The only entity to start a litigation cycle is Oracle, and
           | they've either been uninterested or know they can't win.
           | Canonical is the only entity they have any chance of going
           | after; Canonical's lawyers already decided there was no
           | license conflict.
           | 
           | Linus Torvalds doesn't feel like being the guinea pig by
           | risking ZFS in the mainline kernel. A totally reasonable
           | position while the CDDL+GPL resolution is still ultimately
           | unknown. (And honestly, with OpenZFS supporting a very wide
           | range of Linux versions and FreeBSD at the same time, I have
           | the feeling that mainline inclusion in Linux might not be the
           | best outcome anyway.)
        
             | Nextgrid wrote:
             | I wonder what Oracle would litigate over though? My
             | understanding is that licenses are generally used by
             | copyright holders to restrict what others can do with a
             | work so the holder can profit off the work and/or keep a
             | competitive advantage.
             | 
             | Here I do not see this argument applying since the source
             | is freely available to use and extend; the license
             | explicitly allows someone to compile it and use it. In this
             | case providing prebuilt binaries is more akin to providing
             | a "cache" for something you can (and are allowed to) build
             | locally (using ZFS-DKMS for example) using source you are
             | once again allowed to acquire and use.
             | 
             | What prejudice does it cause to Oracle that the "make"
             | command is ran on Ubuntu's build servers as opposed to
             | users' individual machines? Have similar cases been
             | litigated before where the argument was about who runs the
             | make command, with source that either party has otherwise a
             | right to download & use?
        
               | Dylan16807 wrote:
               | Copying and distributing those binaries is covered by
               | copyright. You have to follow the license if you want to
               | do it. It doesn't matter that end users could legally get
               | the same files in some other manner. Distribution itself
               | is part of the legal framework.
        
             | bubblethink wrote:
             | Not to mention, bcachefs is making progress towards
             | mainline.
        
         | gigatexal wrote:
         | Same. Maybe one day. Though I don't know what will happen
         | first: it gets mainlined into the kernel or we get HL3
        
           | Filligree wrote:
           | So long as the kernel developers are actively hostile to
           | ZFS...
           | 
           | You will take your Btrfs and you will like it.
        
             | gigatexal wrote:
             | I'm holding out for bcachefs and still building ZFS via
             | dkms on current kernels like a madman.
        
               | Filligree wrote:
               | I'm running bcachefs on my desktop right now.
               | 
               | It's promising, but there's... bugs. Right now only
               | performance oriented ones, that I've noticed, but I'd
               | wait a bit longer.
        
         | tux3 wrote:
         | We may get a successor filesystem before that particular
         | situation is sorted out..
         | 
         | By all accounts mainline is at best not interested, if not
         | actively against ZFS on Linux. The last few kerfuffles around
         | symbols used by the out-of-tree module laid out the position
         | rather unambiguously.
        
           | Nextgrid wrote:
           | > The last few kerfuffles around symbols used by the out-of-
           | tree module laid out the position rather unambiguously.
           | 
           | Source, for someone who isn't following kernel mailing lists?
        
             | tux3 wrote:
             | I was thinking of this thread from 5.0 in particular (2019,
             | time flies!)
             | 
             | https://lore.kernel.org/all/20190110182413.GA6932@kroah.com
             | /
        
               | Nextgrid wrote:
               | It's sad to see that free software under a license (and
               | movement) that was born out of someone's frustration with
               | closed-source printer drivers (acting as DRM, albeit
               | inadvertently) appears to include similar DRM whose sole
               | purpose is to restrict usage of a (seemingly arbitrary)
               | selection of symbols.
        
               | tux3 wrote:
               | It is. That said, I can't fault people too much for being
               | afraid of the lawnmower. People have been mowed for much
               | less.
        
           | betaby wrote:
           | > successor filesystem
           | 
           | Which one?
        
             | pa7ch wrote:
             | bcachefs presumably
        
               | j16sdiz wrote:
               | we have heard the same with btrfs.
        
               | chungy wrote:
               | bcachefs has had time to actually mature instead of being
               | kneecapped early on by an angry Linus Torvalds when
               | btrfs's on disk format changed and broke his Fedora
               | install.
        
       | matheusmoreira wrote:
       | So happy to see this. Incremental expansion is extremely
       | important for consumers, homelabs. Now we can gradually expand
       | capacity.
        
       | shrubble wrote:
       | "This feature allows disks to be added one at a time to a RAID-Z
       | group, expanding its capacity incrementally. This feature is
       | especially useful for small pools (typically with only one RAID-Z
       | group), where there isn't sufficient hardware to add capacity by
       | adding a whole new RAID-Z group (typically doubling the number of
       | disks)."
       | 
       | A feature I am excited to see is being added!
        
       | Dwedit wrote:
       | So is it safe to use btrfs for a basic Raid-1 yet?
        
         | wtallis wrote:
         | Yeah, the non-parity RAID modes have been safe for a pretty
         | long time, as long as you RTFM when something goes wrong
         | instead of assuming the recovery procedures match what you'd
         | expect coming from a background of traditional RAID or how ZFS
         | does it. I've been using RAID-1 (with RAID1c3 for metadata
         | since that feature became available) on a NAS for over a decade
         | now without loss of data despite loss of more drives over the
         | years than the array started out with.
        
       | scheme271 wrote:
       | This has been floating around for 2 years at this point so might
       | be a long while until it gets in. Interesting, QNAP somehow added
       | this feature into the code that their QuTS Hero NASes uses. I'm
       | not sure how solid or tested the QNAP code is but it's solid
       | enough that they're shipping it in production.
        
         | jtriangle wrote:
         | Qnap does recommend a full backup before doing so, which tells
         | me it's not exactly production ready as you and I would think
         | of it.
        
           | rincebrain wrote:
           | QNAP's source drops are also kind of wild, in that they
           | branched a looooooooong time ago and have been implementing
           | their own versions of features they wanted since, AFAICT.
        
           | phpisthebest wrote:
           | I would expect any storage array to recommend a full backup
           | any time you are messing with the physical disks. even
           | "production ready" features one would not add, remove, or do
           | anything with the array with out a full backup.
           | 
           | "Production" systems should not even be considered production
           | unless you have a backup of them,
        
       | crote wrote:
       | > After the expansion completes, old blocks remain with their old
       | data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2
       | parity), but distributed among the larger set of disks. New
       | blocks will be written with the new data-to-parity ratio (e.g. a
       | 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data
       | to 2 parity).
       | 
       | Does anyone know why this is the case? When expanding an array
       | which is getting full this will result in a far smaller capacity
       | gain than desired.
       | 
       | Let's assume we are using 5x 10TB disks which are 90% full.
       | Before the process, each disk will contain 5.4TB of data, 3.6TB
       | of parity, and 1TB of free space. After the process and
       | converting it to 6x 10TB, each disk will contain 4.5TB of data,
       | 3TB of parity, and 2.5TB of free space. We can fill this free
       | space with 1.66TB of data and 0.83TB of parity per disk - after
       | which our entire array will contain 36.96TB of data.
       | 
       | If we made a new 6-wide Z2 array, it would be able to contain
       | 40TB of data - so adding a disk this way made us lose over 3TB in
       | capacity! Considering the process is already reading and
       | rewriting basically the entire array, why not recalculate the
       | parity as well?
        
         | louwrentius wrote:
         | The key issue is that you basically have to rewrite all
         | existing data to regain that lost 3TB. This takes a huge amount
         | of time and the ZFS developers have decided not to automate
         | this as part of this feature.
         | 
         | You can do this yourself though when convenient to get those
         | lost TB back.
         | 
         | The RAID VDEV expansion feature actually was quite stale and
         | wasn't being worked on afaik until this sponsorship.
        
         | allanjude wrote:
         | That is not how this will work.
         | 
         | The reason the parity ratio stays the same, is that all of the
         | references to the data are by DVA (Data Virtual Address,
         | effectively the LBA within the RAID-Z vdev).
         | 
         | So the data will occupy the same amount of space and parity as
         | it did before.
         | 
         | All stripes in RAID-Z are dynamic, so if your stripe is 5 wide
         | and your array is 6 wide, the 2nd stripe will start on the last
         | disk and wrap around.
         | 
         | So if your 5x10 TB disks are 90% full, after the expansion they
         | will contain the same 5.4 TB of data and 3.6 TB of parity, and
         | the pool will now be 10 TB bigger.
         | 
         | New writes, will be 4+2 instead, but the old data won't change
         | (they is how this feature is able to work without needing
         | block-pointer rewrite).
         | 
         | See this presentation:
         | https://www.youtube.com/watch?v=yF2KgQGmUic
        
         | magicalhippo wrote:
         | > Considering the process is already reading and rewriting
         | basically the entire array, why not recalculate the parity as
         | well?
         | 
         | Because snapshots might refer to the old blocks. Sure you could
         | recompute, but then any snapshots would mean those old blocks
         | would have to stay around so now you've taken up ~twice the
         | space.
        
         | mustache_kimono wrote:
         | > Does anyone know why this is the case? > Considering the
         | process is already reading and rewriting basically the entire
         | array, why not recalculate the parity as well?
         | 
         | IANA expert but my guess is -- because, here, you don't have to
         | modify block pointers, etc.
         | 
         | ZFS RAIDZ is not like traditional RAID, as it's not just a
         | sequence of arbitrary bits, data plus parity. RAIDZ stripe
         | width is variable/dynamic, written in blocks (imagine a 128K
         | block, compressed to ~88K), and there is no way to quickly tell
         | where the parity data is within a written block, where the end
         | of any written block is, etc.
         | 
         | If you had to instead, modify the block pointers, I'd assume
         | you have to also change each block in the live tree and all
         | dependent (including snapshot) blocks at the same time? That
         | sounds extraordinarily complicated (and this is the data
         | integrity FS!), and much slower, than just blasting through the
         | data, in order.
         | 
         | To do what you want, you can do what one could always do -- zfs
         | send/recv between a filesystem between and old and new
         | filesystem.
        
       | jedberg wrote:
       | I wish Apple and Oracle would have just sorted things out and
       | made ZFS the main filesystem for the Mac. Way back in the day
       | when they first designed Time Machine, it was supposed to just be
       | a GUI for ZFS snapshots.
       | 
       | How cool would it be if we had a great GUI for ZFS (snapshots,
       | volume management, etc.). I could buy a new external disk, add it
       | to a pool, have seamless storage expansion.
       | 
       | It would be great. Ah, what could have been.
        
         | xoa wrote:
         | Yes, this will always depress me, and to me personally be one
         | of the ultimate evils of the court invented idea of "software
         | patents". I've used ZFS with Macs sicne 2011, but without Apple
         | onboard it's never been as smooth as it should have been and
         | has gotten more difficult in some respects. There was a small
         | window where we might have had a really universal, really solid
         | FS with great data guarantees and features. All sorts of things
         | with OS could be so much better, trivial (and even automated)
         | update rollbacks for example. Sigh :(. I hope zvols finally
         | someday get some performance attention so that if nothing else
         | running APFS (or any other OS of course) on top of one is a
         | better experience.
        
         | phs318u wrote:
         | I had the same thoughts about Apple and (at the time) Sun, 15
         | years ago.
         | 
         | https://macoverdrive.blogspot.com/2008/10/using-zfs-to-manag...
        
         | risho wrote:
         | yeah imagine a world where you could use time machine to backup
         | 500gb parallels volumes where only the diff was stored between
         | snapshots rather than needing to back up the whole 500gb volume
         | every single time.
        
           | jedberg wrote:
           | Right, that would be nice wouldn't it?
           | 
           | As a workaround, you can create a sparsevolume to store your
           | parallels volume. Sparsevolumes are stores in bands, and only
           | bands that change get backed up. It might be slightly more
           | efficient.
        
             | risho wrote:
             | wow that sounds like an interesting solution!
        
         | globular-toast wrote:
         | TrueNAS? You can manage the volumes with a web UI. It has a
         | thing you can enable for SMB shares that let you see the
         | previous versions of files on Windows. Or perhaps I don't
         | understand what you're after.
        
           | jedberg wrote:
           | I'm looking for that kind of featureset for my root
           | filesystem on MacOs. :/
        
         | Terretta wrote:
         | > _How cool would it be if we had a great GUI for ZFS
         | (snapshots, volume management, etc.). I could buy a new
         | external disk, add it to a pool, have seamless storage
         | expansion._
         | 
         | See QNAP HERO 5:
         | 
         | https://www.qnap.com/static/landing/2021/quts-hero-5.0/en/in...
         | 
         | NAS Options:
         | 
         | https://www.qnap.com/en-us/product/?conditions=4-3
        
           | jedberg wrote:
           | As far as I can tell I can't use that as my root filesystem,
           | right?
        
         | mustache_kimono wrote:
         | > How cool would it be if we had a great GUI for ZFS
         | (snapshots, volume management, etc.).
         | 
         | How cool would it be if we had a great _TUI_ for ZFS...
         | 
         | Live in the now: https://github.com/kimono-koans/httm
        
       | ErneX wrote:
       | I'm not an expert whatsoever but what I've been doing for my NAS
       | is using mirrored VDEVs. Started with one and later on added a
       | couple more drives for a second mirror.
       | 
       | Coincidentally one of the drives of my 1st mirror died few days
       | ago after rebooting the host machine for updates and I replaced
       | it today, it's been resilvering for a while.
        
         | edmundsauto wrote:
         | I've read this is suboptimal because you are now stressing the
         | drive that has the only copy of your data to rebuild. what are
         | your thoughts?
        
           | deadbunny wrote:
           | Mirrored vdevs resilver a lot faster than zX vdevs. Much less
           | chance of the remaining drive dying during a resilver if it
           | takes hours rather than days.
        
       | tambourine_man wrote:
       | If you don't need real-time redundancy, you maybe better served
       | by something like SnapRAID. It's more flexible, can mismatch disk
       | sizes, performance requirements are much lower, etc.
        
       | hinkley wrote:
       | I'm frustrated because this feature was mentioned by Schwartz
       | when it was still in beta. I thought a new era of home computing
       | was about to start. It didn't, and instead we got The Cloud,
       | which feels like decentralization but is in fact massive
       | centralization (organizational, rather than geographical).
       | 
       | Some of us think people should be hosting stuff from home,
       | accessible from their mobile devices. But the first and to me one
       | of the biggest hurdles is managing storage. And that requires a
       | storage appliance that is simpler than using a laptop, not
       | requiring the skills of an IT professional.
       | 
       | Drobo tried to make a storage appliance, but once you got to the
       | fine print it had the same set of problems that ZFS still does.
       | 
       | All professional storage solutions are built on an assumption of
       | symmetry of hardware. I have n identical (except not the same
       | batch?) drives which I will smear files out across.
       | 
       | Consumers will _never_ have drive symmetry. That's a huge
       | expenditure that few can justify, or much afford. My Synology
       | didn't like most of my old drives so by the time I had a working
       | array I'd spent practically a laptop on it. For a weirdly shaped
       | computer I couldn't actually use directly. I'm a developer, I can
       | afford it. None of my friends can. Mom definitely can't.
       | 
       | A consumer solution needs to assume drive asymmetry. That day it
       | is first plugged in, it will contain a couple new drives, and
       | every hard drive the consumer can scrounge up from junk drawers -
       | save two: their current backup drive and an extra copy. Once the
       | array (with one open slot) is built and verified, then one of the
       | backups can go into the array for additional space and speed.
       | 
       | From then on, the owner will likely buy one or two new drives
       | every year, at whatever price point they're willing to pay, and
       | swap out the smallest or slowest drive in the array. Meaning the
       | array will always contain 2-3 different generation of hard
       | drives. Never the same speed and never the same capacity. And
       | they expect that if a rebuild fails, some of their data will
       | still be retrievable. Without a professional data recovery
       | company.
       | 
       | which rules out all RAID levels except 0, which is nuts. An
       | algorithm that can handle this scenario is consistent hashing.
       | Weighted consistent hashing can handle disparate resources, by
       | assigning more buckets to faster or larger machines. And it can
       | grow and shrink (in a drive array, the two are sequential or
       | simultaneous).
       | 
       | Small and old businesses begin to resemble consumer purchasing
       | patterns. They can't afford a shiny new array all at once. It's
       | scrounging and piecemeal. So this isn't strictly about chasing
       | consumers.
       | 
       | I thought ZFS was on a similar path, but the delays in sprouting
       | these features make me wonder.
        
         | Osiris wrote:
         | I completely agree. To build my array I had to buy several
         | drives at the same time. To expand I had to buy a new drive,
         | move the data onto the array, and then I'm left with the extra
         | drive I had too buy to temporarily store the data because I
         | can't add it to the array.
         | 
         | I would love to have more options for expandable redundancy.
        
         | gregmac wrote:
         | I can afford it, but have a hard time justifying the costs, not
         | to mention scrapped (working) hardware and inconvenience (of
         | swapping to a whole new array).
         | 
         | I started using snapraid [1] several years ago, after finding
         | zfs couldn't expand. Often when I went to add space the "sweet
         | spot" disk size (best $/TB) was 2-3x the size of the previous
         | biggest disk I ran. This was very economical compared to
         | replacing the whole array every couple years.
         | 
         | It works by having "data" and "parity" drives. Data drives are
         | totally normal filesystems, and joined with unionfs. In fact
         | you can mount them independently and access whatever files are
         | on it. Parity drives are just a big file that snapraid updates
         | nightly.
         | 
         | The big downside is it's not realtime redundant: you can lose a
         | day's worth of data from a (data) drive failure. For my use
         | case this is acceptable.
         | 
         | A huge upside is rebuilds are fairly painless. Rebuilding a
         | parity drive has zero downtime, just degraded performance.
         | Rebuilding a data drive leaves it offline, but the rest work
         | fine (I think the individual files are actually accessible as
         | they're restored though). In the worst case you can mount each
         | data drive independently on any system and recover its
         | contents.
         | 
         | I've been running the "same" array for a decade, but at this
         | point every disk has been swapped out at least once (for a
         | larger one), and it's been in at least two different host
         | systems.
         | 
         | [1] https://www.snapraid.it/
        
         | mrighele wrote:
         | > I'm a developer, I can afford it. None of my friends can. Mom
         | definitely can't.
         | 
         | The only thing that can work for your mom and your friend is,
         | in my opinion, a pair of disks in mirror. When the space
         | finished, buy another box with two other disk in mirror.
         | Anything more than this is not only too complex for the average
         | user but also too expensive.
        
           | hinkley wrote:
           | Keeping track of a heterogenous drive array is just as big an
           | imposition.
        
           | bartvk wrote:
           | Even that sounds complex.
           | 
           | If they want network attached storage, I'd just use a single
           | disk NAS, and remotely back it up.
        
           | Quekid5 wrote:
           | I think I would advise against a direct mirroring -- instead,
           | I'd do sync-every-24-hours or something similar.
           | 
           | Both schemes are vulnerable to the (admittedly rarer) errors
           | where both drives fail simultaneously (e.g. mobo fried them)
           | or are just ... destroyed by a fire or whatever.
           | 
           | A periodic sync (while harder to set up) _will_ occasionally
           | save you from the deleting the wrong files which mirroring
           | doesn 't.
           | 
           | Either way: Any truly important data (family photos/videos,
           | etc.) needs to be saved periodically to remote storage.
           | There's no getting around that if you _really_ care about the
           | data.
        
       | canvascritic wrote:
       | this is a really neat addition to raid-z. i recall setting up my
       | zfs pool in the early 2000s and grappling with disk counts
       | because of how rigid expansion was. Good times. this would've
       | made things so much simpler. small nitpick: in the "during
       | expansion" bit, I thought he could have elaborated a touch on
       | restoring the "health of the raidz vdev" part, didn't really
       | follow his reasoning there. but overall, looking forward to this
       | update. nice work.
        
       | eminence32 wrote:
       | The big news here seems to be that iXsystems (the company behind
       | FreeNAS/TrueNAS) is sponsoring this work now. This PR supersedes
       | ones that was opened back in 2021
       | 
       | https://github.com/openzfs/zfs/pull/12225#issuecomment-16101...
        
         | seltzered_ wrote:
         | Yep, see also https://freebsdfoundation.org/blog/raid-z-
         | expansion-feature-... (2022)
         | 
         | (Via
         | https://lobste.rs/s/5ahxj1/raid_z_expansion_feature_for_zfs_...
         | )
        
       ___________________________________________________________________
       (page generated 2023-08-19 23:00 UTC)