[HN Gopher] OpenZFS - add disks to existing RAIDZ ___________________________________________________________________ OpenZFS - add disks to existing RAIDZ Author : shrubble Score : 158 points Date : 2023-08-19 16:35 UTC (6 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | atmosx wrote: | Finally :-) | lyjia wrote: | Wow, I've been hearing this has been in the works for a while. I | am glad to see it released! My RAIDZ array awaits new disks! | cassianoleal wrote: | What do you mean by released? It hasn't even been merged yet. | :) | znpy wrote: | It bothers me so much that zfs is not into mainline linux. I know | it's due to the license incompatibility... :( | Nextgrid wrote: | Has the whole license incompatibility thing actually been | tested/litigated in court? I heard Canonical has (at least at | some point) shipped prebuilt ZFS. I understand that in-tree | inclusion brings its own set of problems, but I'm just asking | about redistribution of binaries - the same binaries you are | allowed to build locally. | | It would be nice to have a precedent deciding on this bullshit | argument once and for all so distros can freely ship prebuilt | binary modules and bring Linux to the modern times when it | comes to filesystems. | | The whole situation is ridiculous. I'd understand if this was | about money, but who exactly gets hurt by users getting a | prebuilt module from somewhere, vs building exactly the same | thing locally from freely-available source? | 2OEH8eoCRo0 wrote: | Because it will be hella fun backing ZFS out of the kernel | source once it's been there for a few years. | rincebrain wrote: | I don't think Linux would like to ship a mass of code that | size that's not GPLed, even if court cases say CDDL is GPL- | compatible. | jonhohle wrote: | It isn't compatible because it adds additional restrictions | regarding patented code. The CDDL was made to be used in | mixed license distributions and only affects individual | files, but the GPL taints anything linked (which is why | LGPL exists). Since the terms of the CDDL can't be | respected in a GPL'd distribution, I can't see a way for it | to ever be included in the kernel repo. | | I don't think there's any issue with canonical shipping a | kmod, but similar to 3 party binary drivers, it would need | to be treated as a different "work". | askiiart wrote: | I can confirm that Ubuntu ships prebuilt ZFS, I used a live | Ubuntu USB to copy some data off a ZFS pool just a couple | weeks ago. | chungy wrote: | The only entity to start a litigation cycle is Oracle, and | they've either been uninterested or know they can't win. | Canonical is the only entity they have any chance of going | after; Canonical's lawyers already decided there was no | license conflict. | | Linus Torvalds doesn't feel like being the guinea pig by | risking ZFS in the mainline kernel. A totally reasonable | position while the CDDL+GPL resolution is still ultimately | unknown. (And honestly, with OpenZFS supporting a very wide | range of Linux versions and FreeBSD at the same time, I have | the feeling that mainline inclusion in Linux might not be the | best outcome anyway.) | Nextgrid wrote: | I wonder what Oracle would litigate over though? My | understanding is that licenses are generally used by | copyright holders to restrict what others can do with a | work so the holder can profit off the work and/or keep a | competitive advantage. | | Here I do not see this argument applying since the source | is freely available to use and extend; the license | explicitly allows someone to compile it and use it. In this | case providing prebuilt binaries is more akin to providing | a "cache" for something you can (and are allowed to) build | locally (using ZFS-DKMS for example) using source you are | once again allowed to acquire and use. | | What prejudice does it cause to Oracle that the "make" | command is ran on Ubuntu's build servers as opposed to | users' individual machines? Have similar cases been | litigated before where the argument was about who runs the | make command, with source that either party has otherwise a | right to download & use? | Dylan16807 wrote: | Copying and distributing those binaries is covered by | copyright. You have to follow the license if you want to | do it. It doesn't matter that end users could legally get | the same files in some other manner. Distribution itself | is part of the legal framework. | bubblethink wrote: | Not to mention, bcachefs is making progress towards | mainline. | gigatexal wrote: | Same. Maybe one day. Though I don't know what will happen | first: it gets mainlined into the kernel or we get HL3 | Filligree wrote: | So long as the kernel developers are actively hostile to | ZFS... | | You will take your Btrfs and you will like it. | gigatexal wrote: | I'm holding out for bcachefs and still building ZFS via | dkms on current kernels like a madman. | Filligree wrote: | I'm running bcachefs on my desktop right now. | | It's promising, but there's... bugs. Right now only | performance oriented ones, that I've noticed, but I'd | wait a bit longer. | tux3 wrote: | We may get a successor filesystem before that particular | situation is sorted out.. | | By all accounts mainline is at best not interested, if not | actively against ZFS on Linux. The last few kerfuffles around | symbols used by the out-of-tree module laid out the position | rather unambiguously. | Nextgrid wrote: | > The last few kerfuffles around symbols used by the out-of- | tree module laid out the position rather unambiguously. | | Source, for someone who isn't following kernel mailing lists? | tux3 wrote: | I was thinking of this thread from 5.0 in particular (2019, | time flies!) | | https://lore.kernel.org/all/20190110182413.GA6932@kroah.com | / | Nextgrid wrote: | It's sad to see that free software under a license (and | movement) that was born out of someone's frustration with | closed-source printer drivers (acting as DRM, albeit | inadvertently) appears to include similar DRM whose sole | purpose is to restrict usage of a (seemingly arbitrary) | selection of symbols. | tux3 wrote: | It is. That said, I can't fault people too much for being | afraid of the lawnmower. People have been mowed for much | less. | betaby wrote: | > successor filesystem | | Which one? | pa7ch wrote: | bcachefs presumably | j16sdiz wrote: | we have heard the same with btrfs. | chungy wrote: | bcachefs has had time to actually mature instead of being | kneecapped early on by an angry Linus Torvalds when | btrfs's on disk format changed and broke his Fedora | install. | matheusmoreira wrote: | So happy to see this. Incremental expansion is extremely | important for consumers, homelabs. Now we can gradually expand | capacity. | shrubble wrote: | "This feature allows disks to be added one at a time to a RAID-Z | group, expanding its capacity incrementally. This feature is | especially useful for small pools (typically with only one RAID-Z | group), where there isn't sufficient hardware to add capacity by | adding a whole new RAID-Z group (typically doubling the number of | disks)." | | A feature I am excited to see is being added! | Dwedit wrote: | So is it safe to use btrfs for a basic Raid-1 yet? | wtallis wrote: | Yeah, the non-parity RAID modes have been safe for a pretty | long time, as long as you RTFM when something goes wrong | instead of assuming the recovery procedures match what you'd | expect coming from a background of traditional RAID or how ZFS | does it. I've been using RAID-1 (with RAID1c3 for metadata | since that feature became available) on a NAS for over a decade | now without loss of data despite loss of more drives over the | years than the array started out with. | scheme271 wrote: | This has been floating around for 2 years at this point so might | be a long while until it gets in. Interesting, QNAP somehow added | this feature into the code that their QuTS Hero NASes uses. I'm | not sure how solid or tested the QNAP code is but it's solid | enough that they're shipping it in production. | jtriangle wrote: | Qnap does recommend a full backup before doing so, which tells | me it's not exactly production ready as you and I would think | of it. | rincebrain wrote: | QNAP's source drops are also kind of wild, in that they | branched a looooooooong time ago and have been implementing | their own versions of features they wanted since, AFAICT. | phpisthebest wrote: | I would expect any storage array to recommend a full backup | any time you are messing with the physical disks. even | "production ready" features one would not add, remove, or do | anything with the array with out a full backup. | | "Production" systems should not even be considered production | unless you have a backup of them, | crote wrote: | > After the expansion completes, old blocks remain with their old | data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 | parity), but distributed among the larger set of disks. New | blocks will be written with the new data-to-parity ratio (e.g. a | 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data | to 2 parity). | | Does anyone know why this is the case? When expanding an array | which is getting full this will result in a far smaller capacity | gain than desired. | | Let's assume we are using 5x 10TB disks which are 90% full. | Before the process, each disk will contain 5.4TB of data, 3.6TB | of parity, and 1TB of free space. After the process and | converting it to 6x 10TB, each disk will contain 4.5TB of data, | 3TB of parity, and 2.5TB of free space. We can fill this free | space with 1.66TB of data and 0.83TB of parity per disk - after | which our entire array will contain 36.96TB of data. | | If we made a new 6-wide Z2 array, it would be able to contain | 40TB of data - so adding a disk this way made us lose over 3TB in | capacity! Considering the process is already reading and | rewriting basically the entire array, why not recalculate the | parity as well? | louwrentius wrote: | The key issue is that you basically have to rewrite all | existing data to regain that lost 3TB. This takes a huge amount | of time and the ZFS developers have decided not to automate | this as part of this feature. | | You can do this yourself though when convenient to get those | lost TB back. | | The RAID VDEV expansion feature actually was quite stale and | wasn't being worked on afaik until this sponsorship. | allanjude wrote: | That is not how this will work. | | The reason the parity ratio stays the same, is that all of the | references to the data are by DVA (Data Virtual Address, | effectively the LBA within the RAID-Z vdev). | | So the data will occupy the same amount of space and parity as | it did before. | | All stripes in RAID-Z are dynamic, so if your stripe is 5 wide | and your array is 6 wide, the 2nd stripe will start on the last | disk and wrap around. | | So if your 5x10 TB disks are 90% full, after the expansion they | will contain the same 5.4 TB of data and 3.6 TB of parity, and | the pool will now be 10 TB bigger. | | New writes, will be 4+2 instead, but the old data won't change | (they is how this feature is able to work without needing | block-pointer rewrite). | | See this presentation: | https://www.youtube.com/watch?v=yF2KgQGmUic | magicalhippo wrote: | > Considering the process is already reading and rewriting | basically the entire array, why not recalculate the parity as | well? | | Because snapshots might refer to the old blocks. Sure you could | recompute, but then any snapshots would mean those old blocks | would have to stay around so now you've taken up ~twice the | space. | mustache_kimono wrote: | > Does anyone know why this is the case? > Considering the | process is already reading and rewriting basically the entire | array, why not recalculate the parity as well? | | IANA expert but my guess is -- because, here, you don't have to | modify block pointers, etc. | | ZFS RAIDZ is not like traditional RAID, as it's not just a | sequence of arbitrary bits, data plus parity. RAIDZ stripe | width is variable/dynamic, written in blocks (imagine a 128K | block, compressed to ~88K), and there is no way to quickly tell | where the parity data is within a written block, where the end | of any written block is, etc. | | If you had to instead, modify the block pointers, I'd assume | you have to also change each block in the live tree and all | dependent (including snapshot) blocks at the same time? That | sounds extraordinarily complicated (and this is the data | integrity FS!), and much slower, than just blasting through the | data, in order. | | To do what you want, you can do what one could always do -- zfs | send/recv between a filesystem between and old and new | filesystem. | jedberg wrote: | I wish Apple and Oracle would have just sorted things out and | made ZFS the main filesystem for the Mac. Way back in the day | when they first designed Time Machine, it was supposed to just be | a GUI for ZFS snapshots. | | How cool would it be if we had a great GUI for ZFS (snapshots, | volume management, etc.). I could buy a new external disk, add it | to a pool, have seamless storage expansion. | | It would be great. Ah, what could have been. | xoa wrote: | Yes, this will always depress me, and to me personally be one | of the ultimate evils of the court invented idea of "software | patents". I've used ZFS with Macs sicne 2011, but without Apple | onboard it's never been as smooth as it should have been and | has gotten more difficult in some respects. There was a small | window where we might have had a really universal, really solid | FS with great data guarantees and features. All sorts of things | with OS could be so much better, trivial (and even automated) | update rollbacks for example. Sigh :(. I hope zvols finally | someday get some performance attention so that if nothing else | running APFS (or any other OS of course) on top of one is a | better experience. | phs318u wrote: | I had the same thoughts about Apple and (at the time) Sun, 15 | years ago. | | https://macoverdrive.blogspot.com/2008/10/using-zfs-to-manag... | risho wrote: | yeah imagine a world where you could use time machine to backup | 500gb parallels volumes where only the diff was stored between | snapshots rather than needing to back up the whole 500gb volume | every single time. | jedberg wrote: | Right, that would be nice wouldn't it? | | As a workaround, you can create a sparsevolume to store your | parallels volume. Sparsevolumes are stores in bands, and only | bands that change get backed up. It might be slightly more | efficient. | risho wrote: | wow that sounds like an interesting solution! | globular-toast wrote: | TrueNAS? You can manage the volumes with a web UI. It has a | thing you can enable for SMB shares that let you see the | previous versions of files on Windows. Or perhaps I don't | understand what you're after. | jedberg wrote: | I'm looking for that kind of featureset for my root | filesystem on MacOs. :/ | Terretta wrote: | > _How cool would it be if we had a great GUI for ZFS | (snapshots, volume management, etc.). I could buy a new | external disk, add it to a pool, have seamless storage | expansion._ | | See QNAP HERO 5: | | https://www.qnap.com/static/landing/2021/quts-hero-5.0/en/in... | | NAS Options: | | https://www.qnap.com/en-us/product/?conditions=4-3 | jedberg wrote: | As far as I can tell I can't use that as my root filesystem, | right? | mustache_kimono wrote: | > How cool would it be if we had a great GUI for ZFS | (snapshots, volume management, etc.). | | How cool would it be if we had a great _TUI_ for ZFS... | | Live in the now: https://github.com/kimono-koans/httm | ErneX wrote: | I'm not an expert whatsoever but what I've been doing for my NAS | is using mirrored VDEVs. Started with one and later on added a | couple more drives for a second mirror. | | Coincidentally one of the drives of my 1st mirror died few days | ago after rebooting the host machine for updates and I replaced | it today, it's been resilvering for a while. | edmundsauto wrote: | I've read this is suboptimal because you are now stressing the | drive that has the only copy of your data to rebuild. what are | your thoughts? | deadbunny wrote: | Mirrored vdevs resilver a lot faster than zX vdevs. Much less | chance of the remaining drive dying during a resilver if it | takes hours rather than days. | tambourine_man wrote: | If you don't need real-time redundancy, you maybe better served | by something like SnapRAID. It's more flexible, can mismatch disk | sizes, performance requirements are much lower, etc. | hinkley wrote: | I'm frustrated because this feature was mentioned by Schwartz | when it was still in beta. I thought a new era of home computing | was about to start. It didn't, and instead we got The Cloud, | which feels like decentralization but is in fact massive | centralization (organizational, rather than geographical). | | Some of us think people should be hosting stuff from home, | accessible from their mobile devices. But the first and to me one | of the biggest hurdles is managing storage. And that requires a | storage appliance that is simpler than using a laptop, not | requiring the skills of an IT professional. | | Drobo tried to make a storage appliance, but once you got to the | fine print it had the same set of problems that ZFS still does. | | All professional storage solutions are built on an assumption of | symmetry of hardware. I have n identical (except not the same | batch?) drives which I will smear files out across. | | Consumers will _never_ have drive symmetry. That's a huge | expenditure that few can justify, or much afford. My Synology | didn't like most of my old drives so by the time I had a working | array I'd spent practically a laptop on it. For a weirdly shaped | computer I couldn't actually use directly. I'm a developer, I can | afford it. None of my friends can. Mom definitely can't. | | A consumer solution needs to assume drive asymmetry. That day it | is first plugged in, it will contain a couple new drives, and | every hard drive the consumer can scrounge up from junk drawers - | save two: their current backup drive and an extra copy. Once the | array (with one open slot) is built and verified, then one of the | backups can go into the array for additional space and speed. | | From then on, the owner will likely buy one or two new drives | every year, at whatever price point they're willing to pay, and | swap out the smallest or slowest drive in the array. Meaning the | array will always contain 2-3 different generation of hard | drives. Never the same speed and never the same capacity. And | they expect that if a rebuild fails, some of their data will | still be retrievable. Without a professional data recovery | company. | | which rules out all RAID levels except 0, which is nuts. An | algorithm that can handle this scenario is consistent hashing. | Weighted consistent hashing can handle disparate resources, by | assigning more buckets to faster or larger machines. And it can | grow and shrink (in a drive array, the two are sequential or | simultaneous). | | Small and old businesses begin to resemble consumer purchasing | patterns. They can't afford a shiny new array all at once. It's | scrounging and piecemeal. So this isn't strictly about chasing | consumers. | | I thought ZFS was on a similar path, but the delays in sprouting | these features make me wonder. | Osiris wrote: | I completely agree. To build my array I had to buy several | drives at the same time. To expand I had to buy a new drive, | move the data onto the array, and then I'm left with the extra | drive I had too buy to temporarily store the data because I | can't add it to the array. | | I would love to have more options for expandable redundancy. | gregmac wrote: | I can afford it, but have a hard time justifying the costs, not | to mention scrapped (working) hardware and inconvenience (of | swapping to a whole new array). | | I started using snapraid [1] several years ago, after finding | zfs couldn't expand. Often when I went to add space the "sweet | spot" disk size (best $/TB) was 2-3x the size of the previous | biggest disk I ran. This was very economical compared to | replacing the whole array every couple years. | | It works by having "data" and "parity" drives. Data drives are | totally normal filesystems, and joined with unionfs. In fact | you can mount them independently and access whatever files are | on it. Parity drives are just a big file that snapraid updates | nightly. | | The big downside is it's not realtime redundant: you can lose a | day's worth of data from a (data) drive failure. For my use | case this is acceptable. | | A huge upside is rebuilds are fairly painless. Rebuilding a | parity drive has zero downtime, just degraded performance. | Rebuilding a data drive leaves it offline, but the rest work | fine (I think the individual files are actually accessible as | they're restored though). In the worst case you can mount each | data drive independently on any system and recover its | contents. | | I've been running the "same" array for a decade, but at this | point every disk has been swapped out at least once (for a | larger one), and it's been in at least two different host | systems. | | [1] https://www.snapraid.it/ | mrighele wrote: | > I'm a developer, I can afford it. None of my friends can. Mom | definitely can't. | | The only thing that can work for your mom and your friend is, | in my opinion, a pair of disks in mirror. When the space | finished, buy another box with two other disk in mirror. | Anything more than this is not only too complex for the average | user but also too expensive. | hinkley wrote: | Keeping track of a heterogenous drive array is just as big an | imposition. | bartvk wrote: | Even that sounds complex. | | If they want network attached storage, I'd just use a single | disk NAS, and remotely back it up. | Quekid5 wrote: | I think I would advise against a direct mirroring -- instead, | I'd do sync-every-24-hours or something similar. | | Both schemes are vulnerable to the (admittedly rarer) errors | where both drives fail simultaneously (e.g. mobo fried them) | or are just ... destroyed by a fire or whatever. | | A periodic sync (while harder to set up) _will_ occasionally | save you from the deleting the wrong files which mirroring | doesn 't. | | Either way: Any truly important data (family photos/videos, | etc.) needs to be saved periodically to remote storage. | There's no getting around that if you _really_ care about the | data. | canvascritic wrote: | this is a really neat addition to raid-z. i recall setting up my | zfs pool in the early 2000s and grappling with disk counts | because of how rigid expansion was. Good times. this would've | made things so much simpler. small nitpick: in the "during | expansion" bit, I thought he could have elaborated a touch on | restoring the "health of the raidz vdev" part, didn't really | follow his reasoning there. but overall, looking forward to this | update. nice work. | eminence32 wrote: | The big news here seems to be that iXsystems (the company behind | FreeNAS/TrueNAS) is sponsoring this work now. This PR supersedes | ones that was opened back in 2021 | | https://github.com/openzfs/zfs/pull/12225#issuecomment-16101... | seltzered_ wrote: | Yep, see also https://freebsdfoundation.org/blog/raid-z- | expansion-feature-... (2022) | | (Via | https://lobste.rs/s/5ahxj1/raid_z_expansion_feature_for_zfs_... | ) ___________________________________________________________________ (page generated 2023-08-19 23:00 UTC)