[HN Gopher] Bcachefs - A New COW Filesystem ___________________________________________________________________ Bcachefs - A New COW Filesystem Author : jlpcsl Score : 263 points Date : 2023-05-11 08:50 UTC (14 hours ago) (HTM) web link (lore.kernel.org) (TXT) w3m dump (lore.kernel.org) | graderjs wrote: | Is there an optimal filesystem or is it all just trade offs? And | how far if we come since you know when we were first creating | file systems plan nine or whatever to now like it's as there been | any sort of technological Leap like killer algorithm it's really | improve things? | dsr_ wrote: | An optimal filesystem... for what? | | There is no single filesystem which is optimized for | everything, so you need to specify things like | | cross-platform transportability, network transparency, hardware | interfaces, hardware capability, reliability requirements, | cost-effectiveness, required features, expected workload, | licensing | | and what the track record is in the real world. | the8472 wrote: | It's all tradeoffs. | jacknews wrote: | "The COW filesystem for Linux that won't eat your data" | | LOL, they know what the problem is at least. I will try it out on | some old hard disks. The others (esp. looking at you btrfs) are | not good at not losing your entire volumes when disks start to go | bad. | eis wrote: | I really hope Linux can get a modern FS into common usage (as in | default FS for most distros). After more than a decade ZFS and | BTRFS didn't go anywhere. Something that's just there as a | default, is stable, performs decently (at least on ext4 level) | and brings modern features like snapshots. Bcachefs seems to have | a decent shot. | | What I'd like to see even more though would be a switch from the | existing posix based filesystem APIs to a transaction based | system. It is way too complicated to do filesystem operations | that are not prone to data corruption should there be any issues. | viraptor wrote: | Btrfs is the default on a few systems already like Fedora, | suse, Garuda, easynas, rockstor, and some others. It's not the | default in Ubuntu and Debian, but I wouldn't say it didn't go | anywhere either. | jadbox wrote: | I'm using BTRfs on Fedora (by default install) and it's been | great over the last year. | | The only thing to be aware of is to disable CoW/Hashing on | database stores or streaming download folders. Otherwise | it'll rehash each file update, which isn't needed. | giantrobot wrote: | Why is it hashing files and not blocks? If a block is | hashed and written there's no need to touch it again. | dnzm wrote: | I've been running it on my NAS-slash-homeserver for... 5 or | 6 years now, I think. Root on a single SSD, data on a few | HDDs in RAID1. It's been great so far. My desktops are all | btrfs too, and the integration between OpenSUSE's package | manager and btrfs snapshots has been useful more than once. | curt15 wrote: | It looks like Fedora's adoption of btrfs unearthed another | data corruption bug recently: | https://bugzilla.redhat.com/show_bug.cgi?id=2169947 | hackernudes wrote: | Wow, that's funny - almost looks like bcachefs explaining a | similar issue here https://lore.kernel.org/lkml/20230509165 | 657.1735798-7-kent.o... | johnisgood wrote: | HAMMER2 supports snapshots. I do not have any experiences with | it though. | aidenn0 wrote: | Is HAMMER2 supported on Linux? I thought it was Dragonfly | only. | joshbaptiste wrote: | yup Dragonfly and soon NetBSD | https://www.phoronix.com/news/NetBSD-HAMMER2-Port | renewiltord wrote: | Is there a high-performance in-kernel FS that acts as a | hierarchical cache that I can export over NFS? Presently I use | `catfs` over `goofys` and then I export the `catfs` mount. | cdavid wrote: | Not sure I understand your use case, but if you have to use | nfs, cachefilesd is very effective for read heavy workload: | https://access.redhat.com/documentation/en-us/red_hat_enterp... | throw0101b wrote: | > _These are RW btrfs-style snapshots_ | | There's a word for 'RW snapshots': clones. E.g. | | * https://docs.netapp.com/us-en/ontap/task_admin_clone_data.ht... | | * http://doc.isilon.com/onefs/9.4.0/help/en-us/ifs_t_clone_a_f... | | * https://openzfs.github.io/openzfs-docs/man/8/zfs-clone.8.htm... | | * http://www.voleg.info/lvm2-clone-logical-volume.html | | In every other implementation I've come across the word | "snapshot" is about read-only copies. I'm not sure why btrfs (and | now bcachefs?) thinks it needs to muddy the nomenclature waters. | webstrand wrote: | Cloning can also mean simple duplication. I think calling it a | RW snapshot is clearer because a snapshot generally doesn't | mean simple duplication. | throw0101a wrote: | > _I think calling it a RW snapshot_ [...] | | So what do you call a RO snapshot? Or do you now need to | write the prefix "RO" and "RW" _everywhere_ when referring to | a "snapshot"? | | How do CLI commands work? Will you have "btrfs snapshot" and | then have to always define whether you want RO or RW on every | invocation? This smells like git's bad front-end CLI | porcelain all over again (regardless of how nice the back-end | plumbing may be). | | This is a solved problem with an established nomenclature | IMHO: just use the already-existing nouns/CLI-verbs of | "snapshot" and "clone". | | > [...] _is clearer because a snapshot generally doesn 't | mean simple duplication._ | | A snapshot generally means a static copy of the data, with | bachefs (and ZFS and btrfs) being CoW then it implies new | copies are not needed unless/until the source is altered. | | If you want deduplication use "dedupe" in your CLI. | nextaccountic wrote: | > So what do you call a RO snapshot | | There should be no read-only snapshot: it's just a writable | snapshot where you don't happen to perform a write | throw0101a wrote: | > _There should be no read-only snapshot: it 's just a | writable snapshot where you don't happen to perform a | write_ | | So when malware comes along and goes after the live copy, | and happens to find the 'snapshot', but is able to hose | that snapshot data as well, the solution is to go to | tape? | | As opposed to any other file system that implements read- | only snapshots, if the live copy is hosed, one can simply | clone/revert to the read-only copy. (This is not a | hypothetical: I've done this personally.) | | (Certainly one should have off-device/site backups, but | being able to do a quick revert is great for MTTR.) | londons_explore wrote: | I would like to see filesystems benchmarked for robustness. | | Specifically, robustness to everything around them not performing | as required. For example, imagine an evil SSD which had a 1% | chance of rolling a sector back to a previous version, a 1% | chance of saying a write failed when it didn't, a 1% chance of | writing data to the wrong sector number, a 1% chance of flipping | some bits in the written data, and a 1% chance of disconnecting | and reconnecting a few seconds later. | | Real SSD's have bugs that make them do all of these things. | | Given this evil SSD, I want to know how long the filesystem can | keep going serving the users usecase. | crest wrote: | A 1% error rates for corrupting other blocks is prohibitively. | A file system would have do extensive forward error correction | in addition to checksumming to have a chance to work with this. | It would also have to perform a lot of background scrubbing to | stay ahead of the rot. While interesting to model and maybe | even relevant as a research problem given the steadily | worsening bandwidth to capacity ratio of affordable bulk | storage I don't expect there are many users willing to accept | the overhead required to come even close to a usable file | system an a device as bad as the one you described. | comex wrote: | Well, it's sort of redundant. According to [1], the raw bit | error rate of the flash memory inside today's SSDs is already | in the 0.1%-1% range. And so the controllers inside the SSDs | already do forward error correction, more efficiently than | the host CPU could do it since they have dedicated hardware | for it. Adding another layer of error correction at the | filesystem level could help with some of the remaining | failure modes, but you would still have to worry about RAM | bitflips after the data has already been read into RAM and | validated. | | [1] https://ieeexplore.ieee.org/document/9251942 | simcop2387 wrote: | ZFS will do this. Give it a RAIDz-{1..3} setup and you've got | the FEC/Parity calculations that happen. Every read has it's | checksum checked, and when reading if it finds issues it'll | start resilvering them asap. You are of course right in that | it will eventually start getting to worse and worse | performance as it's having to do much more rewriting and full | on scrubbing if there are constant amounts of errors | happening but it can generally handle things pretty well. | Dylan16807 wrote: | I don't know... | | Let's say you have 6 drives in raidz2. If you have a 1% | silent failure chance per block, then writing a set of 6 | blocks has a 0.002% silent failure rate. And ZFS doesn't | immediately verify writes, so it won't try again. | | If that's applied to 4KB blocks, then we have a 0.002% | failure rate per 16KB of data. It will take about 36 | thousand sets of blocks to reach 50% odds of losing data, | which is only half a gigabyte. If we look at the larger | block ZFS uses internally then it's a handful of gigabytes. | | And that's without even adding the feature where writing | one block will corrupt other blocks. | toxik wrote: | With this level of adversarial problems, you'd better formulate | your whole IO stack as a two-player minimax game. | throw0101a wrote: | > _For example, imagine an evil SSD which had a 1% chance of | rolling a sector back to a previous version, a 1% chance of | saying a write failed when it didn 't, a 1% chance of writing | data to the wrong sector number, a 1% chance of flipping some | bits in the written data, and a 1% chance of disconnecting and | reconnecting a few seconds later._ | | There are stories from the ZFS folks of dealing with these | issues and things ran just fine. | | While not directly involved with ZFS development (IIRC), Bryan | Cantrill was very 'ZFS-adjacent' since he used Solaris-based | systems for a lot of his career, and he has several rants about | firmware that you can find online. | | A video that went viral many years ago, with Cantrill and | Brendan Gregg, is "Shouting in the Datacenter": | | * https://www.youtube.com/watch?v=tDacjrSCeq4 | | * https://www.youtube.com/watch?v=lMPozJFC8g0 (making of) | amluto wrote: | ISTM one could design a filesystem as a Byzantine-fault- | tolerant distributed system that happens to have many nodes | (disks) partially sharing hardware (CPU, memory, etc). The | result would not look that much like RAID, but would look quite | a bit like Ceph and its relatives. | | Bonus points for making the result efficiently support multiple | nodes, each with multiple disks. | PhilipRoman wrote: | I have basically the opposite problem. I've been looking for a | filesystem that maximizes performance (and minimizes actual | disk writes) at the cost of reliability. As long as it loses | all my data less than once a week, I can live with it. | viraptor wrote: | Have you tried allowing ext4 to ignore all safety? | data=writeback, barrier=0, bump up dirty_ratio, tune | ^has_journal, maybe disable flushes with | https://github.com/stewartsmith/libeatmydata | PhilipRoman wrote: | Thanks, this looks promising | the8472 wrote: | You can also add journal_async_commit,noauto_da_alloc | | > maybe disable flushes with | https://github.com/stewartsmith/libeatmydata | | overlayfs has a volatile mount option that has that effect. | So stacking a volatile overlayfs with the upper and lower | on the same ext4 could provide that behavior even for | applications that can't be intercepted with LD_PRELOAD | seunosewa wrote: | How would you cope with losing all you data once a week? | PhilipRoman wrote: | "once a week" was maybe a too extreme example. For my case | specifically: lost data can be recomputed. Basically a | bunch of compiler outputs, indexes and analysis results on | the input files, typically an order of magnitude larger | than the original files themselves. | | Any files that are important would go to a separate, more | reliable filesystem (or uploaded elsewhere). | kadoban wrote: | On top of other suggestions I've seen you get already, | raid0 might be worth looking at. That has some good speed | vs reliability tradeoffs (in the direction you want). | dur-randir wrote: | Some video production workflows are run on 4xraid0 just for | the speed - it fails rarely enough and intermediate output | is just re-created. | desro wrote: | Can confirm. When I can't work off my internal MacBook | storage, my working drive is a RAID0 NVME array over | Thunderbolt. Jobs setup in Carbon Copy Cloner make | incremental hourly backups to a NAS on site as well as a | locally-attached RAID6 HDD array. | | If the stripe dies, worst case is I lose up to one hour | of work, plus let's say another hour copying assets back | to a rebuilt stripe. | | There are _so many_ MASSIVE files created in intermediate | stages of audiovisual [post] production. | sph wrote: | It all depends on how much reliability are you willing to | give up for performance. | | Because I have the best storage performance you'll ever find | anywhere, 100% money-back guaranteed: write to /dev/null. It | comes with the downside of 0% reliability. | | You can write to a disk without a file-system, sequentially, | until space ends. Quite fast actually, and reliable, until | you reach the end, then reliability drops dramatically. | [deleted] | PhilipRoman wrote: | Yeah, I've had good experience with bypassing fs layer in | the past, especially on a HDD the gains can be insane. But | it won't help as I still need a more-or-less posixy | read/write API. | | P.S. I'm fairly certain that /dev/null would lose my data a | bit more often than once a week. | jasomill wrote: | Trouble is you can't use /dev/null as a filesystem, even | for testing. | | On a related note, though, I've considered the idea of | creating a "minimally POSIX-compliant" filesystem that | randomly reorders and delays I/O operations whenever | standards permit it to do so, along with any other odd | behavior I can find that remains within the _letter_ of | published standards (unusual path limitations, support for | exactly two hard links per file, sparse files that require | holes to be aligned on 4,099-byte boundaries in spite of | the filesystem 's reported 509-byte block size, etc., all | properly reported by applicable APIs). | dralley wrote: | Cue MongoDB memes | ilyt wrote: | Probably just using it for cache of some kind | magicalhippo wrote: | Tongue-in-cheek solution: use a ramdisk[1] for dm-writecache | in writeback mode[2]? | | [1]: https://www.kernel.org/doc/Documentation/blockdev/ramdis | k.tx... | | [2]: https://blog.delouw.ch/2020/01/29/using-lvm-cache-for- | storag... | bionade24 wrote: | Not sure if this is feasible, but have you considered dumping | binary on the raw disk like done with tapes? | crabbone wrote: | You could even use partitions as files. You could only have | 128 files, but maybe that's enough for OP? | rwmj wrote: | Don't you want a RAM disk for this? It'll lose all your data | (reliably!) when you reboot. | | You could also look at this: | https://rwmj.wordpress.com/2020/03/21/new-nbdkit-remote- | tmpf... We use it for Koji builds, where we actually don't | care about keeping the build tree around (we persist only the | built objects and artifacts elsewhere). This plugin is pretty | fast for this use case because it ignores FUA requests from | the filesystem. Obviously don't use it where you care about | your data. | ilyt wrote: | > Don't you want a RAM disk for this? It'll lose all your | data (reliably!) when you reboot. | | Uhh hello, pricing ? | fwip wrote: | Depends on how much space you need. | mastax wrote: | You should be able to do this with basically any file system | by using the mount options `async`(default) `noatime` | disabling journalling, and massively increasing | vm.dirty_background_ratio, vm.dirty_ratio, and | vm.dirty_expire_centisecs. | rwmj wrote: | nbdkit memory 10G --filter=error error-rate=1% | | ... and then nbd-loop-mount that as a block device and create | your filesystem on top. | | Notes: | | We're working on making a ublk interface to nbdkit plugins so | the loop mount wouldn't be needed. | | There are actually better ways to use the error filter, such as | triggering it from a file, see the manual: | https://www.libguestfs.org/nbdkit-error-filter.1.html | | It's an interesting idea to have an "evil" filter that flips | bits at random. I might write that! | antongribok wrote: | How does this compare with dm-flakey [0] ? | | [0]: https://www.kernel.org/doc/html/latest/admin- | guide/device-ma... | Dwedit wrote: | Suddenly I'm reminded of the time someone made a Bad Internet | simulator (causing packet loss or other problems) and named | the program "Comcast". | seized wrote: | I've had ZFS pools survive (at different times over the span of | years): | | - A RAIDz1 (RAID5) pool with a second disk start failing while | rebuilding from an earlier disk failure (data was fine) | | - A water to air CPU cooler leaking, CPU overheated and the | water shorted and killed the HBA running a pool (data was fine) | | - An SFF-8088 cable half plugged in for months, pool would | sometimes hiccup, throw off errors, take a while to list files, | but worked fine after plugging it in properly (data was fine | after) | | Then the usual disk failures which are a non-event with ZFS. | gigatexal wrote: | This is why I always opt for ZFS. | ilyt wrote: | I recovered from 3 disk RAID 6 failure (which itself was | organization failure driven...) in linux's mdadm... ddrescue | to the rescue, I guess I got "lucky" the bad blocks didn't | happen in same place on all drives (one died, other started | returning bad blocks), but chance for that to happen are | infinitesly small in the first place | | So _shrug_ | avianlyric wrote: | How do you know you got lucky with corrupted blocks? | | mdadm doesn't checksum data, and just trusts the HDD to | either return correct data, or an error. But HDDs return | incorrect data all the time, their specs even tell you how | much incorrect data they'll return, and for anything over | about 8TB you're basically guaranteed some silent | corruption if you read every byte. | johnmaguire wrote: | Yes, I also had a "1 disk failed, second disk failed during | rebuild" event (like the parent, not your story) with mdadm | & RAID 6 with no issues. | | People seem to love ZFS but I had no issues running mdadm. | I'm now running a ZFS pool and so far it's been more work | (and things to learn), requires a lot more RAM, and the | benefits are... escaping me. | ysleepy wrote: | Are you sure the data survived? ZFS is sure and proves it | with checksums over metadata and data. I don't know madm | well enough to know if it does this too. | 112233 wrote: | Please, where can i read more about this. I remember bricking | OCZ drive by setting ATA password, as was fashionable to do | back then, but 1% writes going to wrong sector - what are these | drives, fake sd cards from aliexpress? | | Like, which manufacturer goes, like "tests show we cannot write | more that 400kB without corrupting drive, let us ship this!" ? | jlokier wrote: | _> Like, which manufacturer goes, like "tests show we cannot | write more that 400kB without corrupting drive, let us ship | this!" ?_ | | According to https://www.sqlite.org/howtocorrupt.html there | are such drives: | | _4.2. Fake capacity USB sticks_ | | _There are many fraudulent USB sticks in circulation that | report to have a high capacity (ex: 8GB) but are really only | capable of storing a much smaller amount (ex: 1GB). Attempts | to write on these devices will often result in unrelated | files being overwritten. Any use of a fraudulent flash memory | device can easily lead to database corruption, therefore. | Internet searches such as "fake capacity usb" will turn up | lots of disturbing information about this problem._ | | Bit flips amd overwriting wrong sectors from unrelated files | being written are also mentioned. You might think this sort | of thing is just cheap USB flash drives, bit I've been told | about NVMe SSDs violating their guarantees and causing very | strange corruption patterns too. Unfortunately when the cause | is a bug in the storage's device own algorithms, the rare | corruption event is not necessarily limited to a few bytes or | just 1 sector here or there, nor to just the sectors being | written. | | I don't know how prevalent any of these things are really. | The sqlite.org says "most" consumer HDDs lie about commiting | data to the platter before reporting they've done so, but | when I worked with ext3 barriers back in the mid 2000s, the | HDDs I tested had timing consistent with flushing write cache | correctly, and turning off barriers did in fact lead to | observable filesystem corruption on power loss, which was | prevented by turning on barriers. The barriers were so | important they made the difference between embedded devices | that could be reliably power cycled, vs those which didn't | reliably recover on boot. | Dylan16807 wrote: | > Fake capacity USB sticks | | Those drives have a sudden flip from 0% corruption to | 95-100% corruption when you hit their limits. I wouldn't | count that as the same thing. And you can't reasonably | expect anything to work on those. | | > The sqlite.org says "most" consumer HDDs lie about | commiting data to the platter before reporting they've done | so, but when I worked with ext3 barriers back in the mid | 2000s, the HDDs I tested had timing consistent with | flushing write cache correctly | | Losing a burst of writes every once in a while also | manifests extremely differently from steady 1% loss and | needs to be handled in a very different way. And if it's at | power loss it might be as simple as rolling back the last | checkpoint during mount if verification fails. | londons_explore wrote: | Writes going to the wrong sector are usually wear levelling | algorithms gone wrong. Specifically, it normally means the | information about which logical sector maps to which physical | sector was updated not in sync with the actual writing of the | data. This is a common performance 'trick' - by delaying and | aggregating these bookkeeping writes, and taking them off the | critical path, you avoid writing so much data and the user | sees lower latency. | | However, if something like a power failure or firmware crash | happens, and the bookkeeping writes never happen, then the | end result that the user sees after a reboot is their data | written to the wrong sector. | jeffbee wrote: | But that would require Linux hackers to read and understand the | literature, and absorb the lessons of industry practice, | instead of just blurting out their aesthetic ideal of a | filesystem. | sangnoir wrote: | Isn't Linux the most deployed OS in industry (by | practitioners)? Are the Linux hyperscalers hiding their FS | secret-sauce, or perhaps the "aesthetic ideal" filesystems | available to Linux good enough? | jeffbee wrote: | I imagine the hyperscalers are all handling integrity at a | higher level where individual filesystems on single hosts | are irrelevant to the outcome. In such applications, any | old filesystem will do. | | For people who do not have application-level integrity, the | systems that offer robustness in the face of imperfect | storage devices are sold by companies like NetApp, which a | lot of people would sneer at but they've done the math. | Datagenerator wrote: | Seen NetApp's boot messages, it's FreeBSD under the hood | seized wrote: | As is EMC Isilon. | jeffbee wrote: | They have their own filesystem with all manner of | integrity protection. | https://en.wikipedia.org/wiki/Write_Anywhere_File_Layout | deathanatos wrote: | On the whole, I'm not sure that a FS can work around such a | byzantine drive, at least not if its the only such drive in the | system. I'd rather FSes not try to pave over these: these disks | are faulty, and we need to demand, with our wallets, better | quality hardware. | | > _1% chance of disconnecting and reconnecting a few seconds | later_ | | I actually have such an SSD. It's unusable, when it is in that | state. The FS doesn't corrupt the data, but it's hard for the | OS to make forward progress, and obviously a lot of writes fail | at the application level. (It's a shitty USB implementation on | the disk: it disconnects if it's on a USB-3 capable port, and | too much transfer occurs. It's USB-2, though; connecting it to | a USB-2-only port makes it work just fine.) | mprovost wrote: | At the point where it disappears for seconds it's a | distributed system not an attached disk. At this point you | have to start applying the CAP theorem. | | At least in Unix the assumption is that disks are always | attached (and reliable...) so write errors don't typically | bubble up to the application layer. This is why losing an NFS | mount typically just hangs the system until it recovers. | deathanatos wrote: | > _At least in Unix the assumption is that disks are always | attached (and reliable...)_ | | I want to say a physical disk being yanked from the system | (essentially what was happening, as far as the OS could | tell) does cause I/O errors in Linux? I could be wrong | though, this isn't exactly something I try to exercise. | | As for it being a distributed system ... I suppose? But | that's what an FS's log is for: when the drive reconnects, | there will either be a pending WAL entry, or not. If there | is, the write can be persisted, otherwise, it is lost. But | consistency should still happen. | | Now, an _app_ might not be ready for that, but that 's an | app bug. | | But it can always happen that the power goes out, which in | my situation is equivalent to a disk yank. There's also | small children, the SO tripping over a cable, etc. | | But these are different failure modes from some of what the | above post listed, such as disks undoing acknowledged | writes, or lying about having persisted the write. (Some of | the examples are byzantine, some are not.) | [deleted] | brnt wrote: | Erasure coding at the filesystem level? Finally! | | I've not dared try bcachefs out though, I'm quite wary of data | loss, even on my laptop. Does anyone have experience to share? | BlackLotus89 wrote: | Had(have) a Laptop that crashed reproducible when touching it | wrong. Had a few btrfs corruptions on it and after a while got | enough. Have it running bcachefs as rootfs for a few years now | and had no issue whatsoever with it. Home is still btrfs (for | reasons) and had no data loss on that either. Only problems I | had were fixed through booting and mounting it through a rescue | system (no fsck necessary) had that twice in 2 years or so. Was | too lazy to check what the bcachefs hook (aur package) does | wrong. | | Edit: Reasons for home being btrfs. I set this up a long | fucking time ago and it was more or less meant as a stresstest | for bcachefs. Since I didn't want data loss on important data | (like my home) I left my home as btrfs | orra wrote: | Oh! This is very exciting. Bcachefs could be the next gen | filesystem that Linux needs[1]. | | Advantages over other filesystems: | | * ext4 or xfs -- these two don't use ECC to protect your data, | only the filesystem metadata | | * zfs -- zfs is technically great, but binary distribution of the | zfs code is tricky, because the CDDL is GPL incompatible | | * btrfs -- btrfs still doesn't have reliable RAID5 | | [1] It's been in development for a number of years. It now being | proposed for inclusion in the mainline kernel is a major | milestone. | vladvasiliu wrote: | > * zfs -- zfs is technically great, but binary distribution of | the zfs code is tricky, because the CDDL is GPL incompatible | | Building your own ZFS module is easy enough, for example on | Arch with zfs-dkms. | | But there's also the issue of compatibility. Sometimes kernel | updates will break ZFS. Even minor ones, 6.2.13 IIRC broke it, | whereas 6.2.12 was fine. | | Right now, 6.3 seems to introduce major compatibility problems. | | --- | | edit: looking through the openzfs issues, I was likely thinking | of 6.2.8 breaking it, where 6.2.7 was fine. Point stands, | though. https://github.com/openzfs/zfs/issues/14658 | | Regarding 6.3 support, it apparently is merged in the master | branch, but no release as of yet. | https://github.com/openzfs/zfs/issues/14622 | kaba0 wrote: | It might help someone: nixos can be configured to always use | the latest kernel version that is compatible with zfs, I | believe its | config.boot.zfs.package.latestCompatibleLinuxPackages . | bjoli wrote: | How is the legal situation of doing that? If I had a company | I wouldn't want to get in trouble with any litigious | companies. | boomboomsubban wrote: | Unless you're distributing, I don't see how anybody could | do anything. Personal (or company wide) use has always | allowed the mixing of basically any licenses. | | The worst case scenarios would be something like Ubuntu | being unable to provide compiled modules, but dkms would | still be fine. Or the very unlikely ZFS on Linux getting | sued, but that would involve a lengthy trial that would | allow you to move away from Open ZFS. | chasil wrote: | The danger is specifically to the copyright holders of | Linux - the authors who have code in the kernel. If they | do not defend their copyright, then it is not strong and | can be broken in certain scenarios. | | "Linux copyright holders in the GPL Compliance Project | for Linux Developers believe that distribution of ZFS | binaries is a GPL violation and infringes Linux's | copyright." | | Linux bundling ZFS code would bring this text against the | GPL: "You may not offer or impose any terms on any | Covered Software in Source Code form that alters or | restricts the applicable version of [the CDDL]." | | Ubuntu distributes ZFS as an out of tree module, which | taints the kernel at immediately at installation. | Hopefully, this is enough to prevent a great legal | challenge. | | https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/ | boomboomsubban wrote: | Yes, distribution has legal risks. Use does not, it only | has the risk that they are unable to get ZFS distributed. | vladvasiliu wrote: | The ArchZFS project distributes binary kernel images with | ZFS integrated. I don't know what the legal situation is | for that. | | In my case, the Arch package is more of a "recipe maker". | It fetches the Linux headers and the zfs source code and | compiles this for local use. As far as they are concerned, | there is no distribution of the resulting artifact. IANAL, | but I think if there's an issue with that, then OpenZFS is | basically never usable under Linux. | | Other companies distributed kernels with zfs support | directly, such as Ubuntu. I don't recall there being news | of them being sued over this, but maybe they managed to | work something out. | 5e92cb50239222b wrote: | archzfs does not distribute any kernel images, they only | provide pre-built modules for the officially supported | kernels. | dsr_ wrote: | IANAL. | | Oracle is very litigious. However, OpenZFS has been | releasing code for more than a decade. Ubuntu shipped | integrated ZFS/Linux in 2016. It's certain that Oracle | knows all about it and has decided that being vague is more | in their interests than actually settling the matter. | | On my list of potential legal worries, this is not a | priority for me. | cduzz wrote: | I would add to this "IANAL But" list | | https://aws.amazon.com/fsx/openzfs/ | | So -- AWS / Amazon are certainly big enough to have | reviewed the licenses and have some understanding of | potential legal risks of this. | orra wrote: | You're right that DKMS is fairly easy (at least until you | enable secure boot). | | > Even minor ones, 6.2.13 IIRC broke it, whereas 6.2.12 was | fine. | | Interesting! | | It's just a shame the license has hindered adoption. Ubuntu | were shipping binary ZFS modules at one point, but they have | walked back from that. | vladvasiliu wrote: | > You're right that DKMS is fairly easy (at least until you | enable secure boot). | | Still easy. Under Arch, the kernel image isn't signed, so | if you enable secure boot you need to fiddle with signing | on your own. At that point, you can just sign the kernel | once the module is built. Works fine for me. | mustache_kimono wrote: | > Ubuntu were shipping binary ZFS modules at one point, but | they have walked back from that. | | This is incorrect? Ubuntu is still shipping binary modules. | orra wrote: | Right, but various things point to ZFS being de facto | deprecated: https://www.omgubuntu.co.uk/2023/01/ubuntu- | zfs-support-statu... | mustache_kimono wrote: | > various things point to ZFS being de facto deprecated | | I'm not sure that's the case? Your link points to the ZFS | _on root install_ being deprecated on the _desktop_. I 'm | not sure what inference you/we can draw from that | considering ZFS is a major component to LXD, and Ubuntu | and Linux's sweet spot is as a server OS. | | > Ubuntu were shipping binary ZFS modules at one point, | but they have walked back from that. | | Not to be persnickety, but this was your claim, and | Ubuntu is still shipping ZFS binary modules on all it's | current releases. | orra wrote: | Yeah, my wording was clumsy, but thanks for assuming good | faith. I essentially meant their enthusiasm had wained. | | It's good you can give reasons ZFS is still important to | Ubuntu on the server, although as a desktop user I'm sad | nobody wants to ship ZFS for the desktop. | ilyt wrote: | > [1] It's been in development for a number of years. It now | being proposed for inclusion in the mainline kernel is a major | milestone. | | not a measure of quality in the slightest. btrfs had some | serious bugs over the years despise being in mainline | orra wrote: | True, but bcachefs gives the impression of being better | designed, nor being rushed upstream. I think it helps that | bcachefs evolved from bcache. | the8472 wrote: | > zfs is technically great | | It's only great due to the lack of competitors in the | checksummed-CoW-raid category. It lacks a bunch of things: | Defrag, Reflinks, On-Demand Dedup, Rebalance (online raid | geometry change, device removal, device shrink). It also wastes | RAM due to page cache + ARC. | macdice wrote: | Reflinks and copy_file_range() are just landing in OpenZFS | now I think? (Block cloning) | pongo1231 wrote: | Block cloning support has indeed recently landed in git and | already allows for reflinks under FreeBSD. Still has to be | wired up for Linux though. | mustache_kimono wrote: | Really excited about this. | | Once support hits in Linux, a little app of mine[0] will | support block cloning for its "roll forward" operation, | where all previous snapshots are preserved, but a | particular snapshot is rolled forward to the live | dataset. Right now, data is simply diff copied in chunks. | When this support hits, there will be no need to copy any | data. Blocks written to the live dataset can just be | references to the underlying snapshot blocks, and no | extra space will need to be used. | | [0]: https://github.com/kimono-koans/httm | nextaccountic wrote: | What does it mean to roll forward? I read the linked | Github and I don't get what is happening | | > Roll forward to a previous ZFS snapshot, instead of | rolling back (this avoids destroying interstitial | snapshots): sudo httm --roll-forward=r | pool/scratch@snap_2023-04-01-15:26:06_httmSnapFileMount | [sudo] password for kimono: httm took a pre- | execution snapshot named: rpool/scratch@snap_pre_2023-04- | 01-15:27:38_httmSnapRollForward ... httm | roll forward completed successfully. httm took a | post-execution snapshot named: rpool/scratch@snap_post_20 | 23-04-01-15:28:40_:snap_2023-04-01-15:26:06_httmSnapFileM | ount:_httmSnapRollForward | mustache_kimono wrote: | From the help and man page[0]: --roll- | forward="snap_name" traditionally 'zfs | rollback' is a destructive operation, whereas httm roll- | forward is non-destructive. httm will copy only files | and their attributes that have changed since a specified | snapshot, from that snapshot, to its live dataset. httm | will also take two precautionary snapshots, one before | and one after the copy. Should the roll forward fail for | any reason, httm will roll back to the pre-execution | state. Note: This is a ZFS only option which requires | super user privileges. | | I might also add 'zfs rollback' is a destructive | operation because it destroys snapshots between the | current live version of the filesystem and the rollback | snapshot target (the 'interstitial' snapshots). Imagine | you have a ransom-ware installed and you _need_ to | rollback, but you want to view the ransomware 's | operations through snapshots for forensic purposes. You | can do that. | | It's also faster than a checksummed rsync, because it | makes a determination based on the underlying ZFS | checksums, or more accurate than a non-checksummed rsync. | | This is a relatively minor feature re: httm. I recommend | installing and playing around with it a bit. | | [0]: https://github.com/kimono- | koans/httm/blob/master/httm.1 | nextaccountic wrote: | What I don't understand is: aren't zfs snapshots | writable, like in btrfs? | | If I wanted to rollback the live filesystem into a | previous snapshot, why couldn't I just start writing into | the snapshot instead? (Or create another snapshot that is | a clone of the old one, and write into it) | throw0101a wrote: | > _What I don 't understand is: aren't zfs snapshots | writable, like in btrfs?_ | | ZFS snapshots, following the historic meaning of | "snapshot", are read-only. ZFS supports _cloning_ of a | read-only snapshot to a writable volume /file system. | | * https://openzfs.github.io/openzfs-docs/man/8/zfs- | clone.8.htm... | | Btrfs is actually the one 'corrupting' the already- | accepted nomenclature of snapshots meaning a read-only | copy of the data. | | I would assume the etymology of the file system concept | of a "snapshot" derives from photography, where something | is frozen at a particular moment of time: | | > _In computer systems, a snapshot is the state of a | system at a particular point in time. The term was coined | as an analogy to that in photography._ [...] _To avoid | downtime, high-availability systems may instead perform | the backup on a snapshot--a read-only copy of the data | set frozen at a point in time--and allow applications to | continue writing to their data. Most snapshot | implementations are efficient and can create snapshots in | O(1)._ | | * | https://en.wikipedia.org/wiki/Snapshot_(computer_storage) | | * https://en.wikipedia.org/wiki/Snapshot_(photography) | orra wrote: | Sure, there's lots of room for improvement. IIRC, rebalancing | might be a WIP, finally? | | But credit where credit is due: for a long time, ZFS has been | the only fit for purpose filesystem, if you care about the | integrity of your data. | the8472 wrote: | Afaik true rebalancing isn't in the works. Some limited | add-device and remove-vdev features are in progress but | AIUI they come with additional overhead and aren't as | flexible. | | btrfs and bcachefs rebalance leave your pool as if you had | created it from scratch with the existing data and the new | layout. | e12e wrote: | > [ZFS is] only great due to the lack of competitors in the | checksummed-CoW-raid category. | | You forgot robust native encryption, network transparent | dump/restore (ZFS send/receive) - and broad platform support | (not so much anymore). | | For a while you could have a solid FS with encryption support | for your USB hd that could be safely used with Linux, *BSD, | Windows, Open/FOSS Solaris and MacOS. | josephg wrote: | Is it just the implementation of zfs which is owned by | oracle now? I wonder how hard it would be to write a | compatible clean room reimplementation of zfs in rust or | something, from the spec. | | Even if it doesn't implement every feature from the real | zfs, it would still be handy for OS compatibility reasons. | nine_k wrote: | I would suppose it would take years of effort? and a lot | of testing in search of performance enhancements and | elimination of corner cases. Even if the code of the FS | itself is created in a provably correct manner (a very | tall order even with Rust), real hardware has a lot of | quirks which need to be addressed. | chasil wrote: | I wish the btrfs (and perhaps bcachefs) projects would | collaborate with OpenZFS to rewrite equivalent code that | they all used. | | It might take years, but washing Sun out of OpenZFS is | the only thing that will free it. | mustache_kimono wrote: | OpenZFS is already free and open source. Linux kernel | developers should just stop punching themselves in face. | | One way to solve the ZFS issue, Linus Torvalds could call | a meeting of project leadership, and say, "Can we all | agree that OpenZFS is not a derived work of Linux? It | seems pretty obvious to anyone who understands the | meaning of copyright term of art 'derived work' and the | origin of ZFS ... Good. We shall add a commit which | indicates such to the COPYING file [0], like we have for | programs that interface at the syscall boundary to clear | up any further confusion." | | Can you imagine trying to bring a copyright infringement | suit (with no damages!) in such an instance? | | The ZFS hair shirt is a self imposed by semi-religious | Linux wackadoos. | | [0]: See, https://github.com/torvalds/linux/blob/master/L | ICENSES/excep... | AshamedCaptain wrote: | Even if you were to be able to say that OpenZFS is not a | derived work of Linux, all it would allow you to do is to | distribute OpenZFS. You would _still_ not be able to | distribute OpenZFS + Linux as a combined work. | | (I am one of these guys who thinks what Ubuntu is doing | is crossing the line. To package two pieces of software | whose license forbids you from distributing their | combination in a way that "they are not combined but can | be combined with a single click" is stretching it too | much. ) | | It would be much simpler for Oracle to simply relicense | older versions of ZFS under another license. | mustache_kimono wrote: | > Even if you were to be able to say that OpenZFS is not | a derived work of Linux, all it would allow you to do is | to distribute OpenZFS. You would _still_ not be able to | distribute OpenZFS + Linux as a combined work. | | Why? Linus said such modules and distribution were | acceptable re: AFS, _an instance which is directly on | point_. See: https://lkml.org/lkml/2003/12/3/228 | AshamedCaptain wrote: | Where is he saying that you can distribute the combined | work? That would not only violate the GPL, it would also | violate AFS's license... | | The only thing he's saying that there is that he's not | even 100% sure whether AFS module is a derived work or | not (if it was, it would be a violation _just to | distribute the module by itself_!). Go imagine what his | opinion will be on someone distributing a kernel already | almost pre-linked with ZFS. | | Not that it matters, since he's not the license author | not even the copyright holder these days... | mustache_kimono wrote: | > Where is he saying that you can distribute the combined | work? | | What's your reasoning as to why one couldn't, if we grant | Linus's reasoning re: AFS as it applies to ZFS? | | > Not that it matters, since he's not the license author | not even the copyright holder these days... | | Linux kernel community has seen fit to give its | assurances re: other clarifications/exceptions. See the | COPYING file. | rascul wrote: | Linus has some words on this matter: | | > And honestly, there is no way I can merge any of the | ZFS efforts until I get an official letter from Oracle | that is signed by their main legal counsel or preferably | by Larry Ellison himself that says that yes, it's ok to | do so and treat the end result as GPL'd. | | > Other people think it can be ok to merge ZFS code into | the kernel and that the module interface makes it ok, and | that's their decision. But considering Oracle's litigious | nature, and the questions over licensing, there's no way | I can feel safe in ever doing so. | | > And I'm not at all interested in some "ZFS shim layer" | thing either that some people seem to think would isolate | the two projects. That adds no value to our side, and | given Oracle's interface copyright suits (see Java), I | don't think it's any real licensing win either. | | https://www.realworldtech.com/forum/?threadid=189711&curp | ost... | mustache_kimono wrote: | > Linus has some words on this matter: | | I hate to point this out, but this only demonstrates | Linux Torvalds doesn't know much about copyright law. | Linus could just as easily say "I was wrong. Sorry! As | you all know -- IANAL. It's time we remedied this stupid | chapter in our history. After all, _I gave similar | assurances to the AFS module_ when it was open sourced | under a GPL incompatible license in 2003. " | | Linus's other words on the matter[0]: | | > But one gray area in particular is something like a | driver that was originally written for another operating | system (ie clearly not a derived work of Linux in | origin). At exactly what point does it become a derived | work of the kernel (and thus fall under the GPL)? | | > THAT is a gray area, and _that_ is the area where I | personally believe that some modules may be considered to | not be derived works simply because they weren't designed | for Linux and don't depend on any special Linux | behaviour. | | [0]: https://lkml.org/lkml/2003/12/3/228 | kaba0 wrote: | > wonder how hard it would be to write a compatible clean | room reimplementation of zfs in rust or something, from | the spec | | As for every non-trivial application - almost impossible. | 0x457 wrote: | Not exactly ZFS in Rust, but more like a replacement for | ZFS in Rust: https://github.com/redox-os/tfs | | Worked stalled, though. Not compatible, but I was working | on overlayfs for freebsd in rust, and it was not pleasant | at all. Can't imagine making an entire "real" file system | in Rust. | gigatexal wrote: | "Wastes" ram? That's a tunable my friend. | viraptor wrote: | https://github.com/openzfs/zfs/issues/10516 | | The data goes through two caches instead of just page cache | or just arc as far as I understand it. | quotemstr wrote: | Can I totally disable ARC yet? | throw0101a wrote: | zfs set primarycache=none foo/bar | | ? | | Though this will amplify reads as even metadata will need | to be fetched from disk, so perhaps "=metadata" may be | better. | | * https://openzfs.github.io/openzfs- | docs/man/7/zfsprops.7.html... | vluft wrote: | I'm curious what your workflow is that not having any | disk caching would have acceptable performance. | 0x457 wrote: | A workflow where the person doesn't understand that RAM | isn't wasted and it just their utility to show usage is | wrong. Imagine being mad at file system cache being | stored in RAM. | quotemstr wrote: | The problem with ARC in ZFS on Linux is the double | caching. Linux already has a page cache. It doesn't need | ZFS to provide a second page cache. I want to store | things in the Linux page cache once, not once in the page | cache and once in ZFS's special-sauce cache. | | If ARC is so good, it should be the general Linux page | cache algorithm. | mustache_kimono wrote: | > It's only great due to the lack of competitors in the | checksummed-CoW-raid category. | | _blinks eyes, shakes head_ | | "It's only great because it's the only thing that's figured | out how to do a hard thing really well" may be peak FOSS | entitlement syndrome. | | Meanwhile, btrfs has rapidly gone nowhere, and, if you read | the comments to this PR, bcachefs would love to get to simply | nowhere/btrfs status, but is still years away. | | ZFS fulfills the core requirement of a filesystem, which is | to store your data, such that when you read it back you can | be assured it was the data you stored. It's amazing we | continue to countenance systems that don't do this, simply | because not fulfilling this core requirement was once | considered acceptable. | Dylan16807 wrote: | I don't see what's entitled about the idea that "it | fulfills the core requirements" is enough to get it "good" | status but not "great" status. Even if that's really rare | among filesystems. | throw0101a wrote: | > _Meanwhile, btrfs has rapidly gone nowhere_ [...] | | A reminder that it came out in 2009: | | * https://en.wikipedia.org/wiki/Btrfs | | (ext4 was declared stable in 2008.) | deepspace wrote: | Yes! File systems are hard. My prediction is that it will | be *at least* 10 years before this newfangled FS gains | both feature- and stability parity with BTRFS and ZFS. | | Also, BTRFS (albeit a modified version) has been used | successfully in at least one commercial NAS (Synology), | for many years. I don't see how that counts as "gone | nowhere". | throw0101a wrote: | Are all the foot guns described described in 2021 been | fixed? | | * https://arstechnica.com/gadgets/2021/09/examining- | btrfs-linu... | dnzm wrote: | Not sure about "all", but apart from that article being | more pissy than strictly necessary, RAID1 can now, in | fact survive losing ore than one disk. That is, provided | you use RAID1C3 or C4 (which keeps 3 or 4 copies, rather | than the default 2). Also, not really sure how RAID1 not | surviving >1 disk failure is a slight against btrfs, I | think most filesystems would have issues there... | | As for the rest of the article -- the tone rubs me the | wrong way, and somehow considering a FS shit because you | couldn't be bothered to use the correct commands (the | scrub vs balance ranty bit) doesn't instill confidence in | me that the article is written in good faith. | | I believe the writer's biggest hangup/footgunnage with | btrfs is still there: it's not zfs. Ymmv. | mustache_kimono wrote: | > Also, BTRFS (albeit a modified version) has been used | successfully in at least one commercial NAS (Synology), | for many years. I don't see how that counts as "gone | nowhere". | | Excuse me for sounding glib. My point was btrfs isn't | considered a serious competitor to ZFS in many of the | spaces ZFS operates. Moreover, it's inability to do | RAID5/6 after years of effort is just weird now. | ilyt wrote: | Yeah world decided just replicating data somewhere is far | preferable if you want to have resilience, instead of making | the separate nodes more resilient. | rektide wrote: | Btrfs still highly recommends a raid1 mode for Metadata, but | for data itself, the raid-5 is fine. | | I somewhat recall there being a little progress on trying to | fix the remaining "write hole" issues, in the past year or two. | But in general, I think there's very little pressure to do so | because so very many many people run raid-5 for data already & | it works great. Getting Metadata off raid1 is low priority, a | nice to have. | kiririn wrote: | Raid5 works ok until you scrub. Even scrubbing one device at | a time is a barrage of random reads sustained for days at a | time | | I'll very happily move back from MD raid 5 when linear scrub | for parity raid lands | tremon wrote: | Still, even with raid1 for metadata and raid5 for data, the | kernel still shouts at you about it being EXPERIMENTAL every | time you mount such a filesystem. I understand that it's best | to err on the side of caution, but that notice does a good | job of persisting the idea that btrfs isn't ready for prime- | time use. | | I use btrfs on most of my Linux systems now (though only one | with raid5), except for backup disks and backups volumes: | those I intend to keep on ext4 indefinitely. | sedatk wrote: | > btrfs still doesn't have reliable RAID5 | | Synology offers btrfs + RAID5 without warning the user. I | wonder why they're so confident with it. | bestham wrote: | They are running brtrfs on top of DM. | https://kb.synology.com/en- | nz/DSM/tutorial/What_was_the_RAID... | sedatk wrote: | Thanks for the link! | sporkle-feet wrote: | Synology doesn't use the btrfs raid - AIUI they layer non- | raid btrfs over raid LVM | IAmLiterallyAB wrote: | Here's a link to the Bcachefs site https://bcachefs.org/ | | I think it summarizes its features and strengths pretty well, and | it has a lot of good technical information. | sumtechguy wrote: | Does anyone know if there any good links to current benchmarks | between the diff types? My googlefu is only finding stuff form | 2019. | anentropic wrote: | I can't help reading this name as Bca-chefs | | (...I realise it must be B-cache-fs) | p1mrx wrote: | Maybe we could call it b$fs | baobrien wrote: | huh, this is fun: | https://lore.kernel.org/lkml/ZFrBEsjrfseCUzqV@moria.home.lan... | | There's a little x86-64 code generator in bcachefs to generate | some sort of btree unpacking code. | dathinab wrote: | This is also the point which is the most likely to cause | problems for this patch series (which is only fixes and utils | added to the kernel) and the bcachefs in general. | | Like when you have an entry like "bring back function which | could make developing viruses easier (through not a | vulnerability by themself) related to memory management and | code execution" the default answer is nop .. nooop .. never. | (Which doesn't mean that it won't come back). | | It seems while it's not necessary to have this it's a non- | neglible performance difference. | viraptor wrote: | It would be really nice if he posted the difference | with/without the optimisation for context. I hope it's going | to be included in the explanation post he's planning. | kzrdude wrote: | It looks like the code generator is only available for x86 | anyway, so it seems niche that way. I am all about baseline | being good performance, not the special case. | BenjiWiebe wrote: | He mentions he wants to make the same type of | optimization for ARM, so ARM+x86 certainly wouldn't be | niche. | | I wouldn't even call x86 alone niche... | Permik wrote: | I'll be eagerly waiting for the upcoming optimization writeup | mentioned here: | https://lore.kernel.org/lkml/ZFyAr%2F9L3neIWpF8@moria.home.l... | mastax wrote: | Please post it on HN because I won't remember to go looking | for it. | dontlaugh wrote: | It's bad enough that the kernel includes a JIT for eBPF. Adding | more of them without hardware constraints and/or formal | verification seems like a bad idea to me. | baobrien wrote: | yeah, most of the kernel maintainers in that thread seem to | be against it. bcachefs does seem to also have a non-code- | generating implementation of this, as it runs on | architectures other than x86-64. | sporkle-feet wrote: | The feature that caught my eye is the concept of having different | targets. | | A fast SSD can be set as the target for foreground writes, but | that data will be transparently copied in the background to a | "background" target, i.e. a large/slow disk. | | If this works, it will be awesome. | viraptor wrote: | You can also have that at block level (which is where bcache | itself comes from). Facebook used it years ago and I had it on | an SSD+HDD laptop... a decade ago at least? Unless you want the | filesystem to know about it, it's ready to go now. | jwilk wrote: | Look up --write-mostly and --write-behind options in mdadm(8) | man page. | | I can't recommend such a setup though. It works very poorly | for me. | saltcured wrote: | See the lvmcache(7) manpage, which I think may be what the | earlier poster was thinking of. It isn't an asymmetric RAID | mode, but a tiered caching scheme where you can, for | example, put a faster and smaller enterprise SSD in front | of a larger and slower bulk store. So you can have a large | bulk volume but the recently/frequently used blocks get the | performance of the fast cache volume. | | I set it up in the past with an mdadm RAID1 array over SSDs | as a caching layer in front of another mdadm array over | HDDs. It performed quite well in a developer/compute | workstation environment. | viraptor wrote: | I did mean bcache specifically. | https://www.kernel.org/doc/Documentation/bcache.txt | throw0101a wrote: | > _A fast SSD can be set as the target for foreground writes, | but that data will be transparently copied in the background to | a "background" target, i.e. a large/slow disk._ | | This is very similar in concept to (or an evolution of?) ZFS's | ZIL: | | * https://www.servethehome.com/what-is-the-zfs-zil-slog-and- | wh... | | * https://www.truenas.com/docs/references/zilandslog/ | | * https://www.45drives.com/community/articles/zfs-caching/ | | When this feature was first introduced to ZFS in the Solaris 10 | days there was an interesting demo from a person at Sun that I | ran across: he was based in a Sun office on the US East Coast | where he did stuff, but had access to Sun lab equipment across | the US. He mounted iSCSI drives that were based in (IIRC) | Colorado as a ZFS poool, and was using them for Postgres stuff: | the performance was unsurprisingly not good. He then add a | local ZIL to the ZFS pool and got I/O that was not too far off | from some local (near-LAN) disks he was using for another pool. | seized wrote: | ZIL is just a fast place to write the data for sync | operations. If everything is working then the ZIL is never | read from, ZFS uses RAM as that foreground bit. | | Async writes on a default configuration don't hit the ZIL, | only RAM for a few seconds then disk. Sync writes are RAM to | ZIL, confirm write, then RAM to pool. | ThatPlayer wrote: | But ZIL is a cache, and not usable for long-term storage. If | I combine a 1TB SSD with a 1TB HDD, I get 1TB of usable | space. In bcachefs, that's 2TB of usable space. | | Bcache (not bcachefs) is more equivalent to ZIL. | harvie wrote: | What i really miss when compared to ZFS is ability to create | datasets. I really like to use ZFS subvolumes for LXC containers. | That way i can have separate sub-btree for each container with | it's own size limit without having to create partitions or LVs, | format the filesystem and then resize everything when i need to | grow the partition or even defragment fs before shrinking it. | With ZFS i can easily give and take disk capacity to my | containers without having to do any multi step operation that | requires close attention to prevent accidental data loss. | | Basicaly i just state what size i want that subtree to be and it | happens without having to touch underlying block devices. Also i | can change it anytime during runtime extremely easily. Eg.: | | zfs set quota=42G tank/vps/my_vps | | zfs set quota=32G tank/vps/my_vps | | zfs set quota=23G tank/vps/my_other_vps | | btrfs can kinda do this as well, but the commands are not as | straighforward as in zfs. | | update: My bad. bcachefs seems to have subvolumes now. there is | also some quota support, but so far the documentation is bit | lacking, so not yet sure how to use that and if that can be | configured per dataset. | layer8 wrote: | I parsed this as "BCA chefs" at first. | curt15 wrote: | For some reason VM and DB workloads are btrfs's Achilles heel but | ZFS seems to handle them pretty well (provided that a suitable | recordsize is set). How do they perform on bcachefs? | candiddevmike wrote: | I've never had a problem with these on BTRFS with COW disabled | on their directories... | pongo1231 wrote: | The issue is that also disables many of the interesting | features of BTRFS for those files. No checksumming, no | snapshots and no compression. In comparison ZFS handles these | features just fine for those kinds of files without the | enormous performance / fragmentation issues of BTRFS (without | nodatacow). | [deleted] | MisterTea wrote: | Another File system I am interested in is GEFS - good enough fs | (rather - "great experimental file shredder" until stable ;-). | It's based on B-epsilon trees, a data structure which wasn't | around when ZFS was designed. The idea is to build a ZFS like fs | without the size and complexity of zfs. So far its plan 9 only | and not production ready though there is a chance it could be | ported to OpenBSD and a talk was given at NYC*BUG: | https://www.nycbug.org/index?action=view&id=10688 | | Code: http://shithub.us/ori/gefs/HEAD/info.html | voxadam wrote: | You're interested in more detailed information about bcachefs I | highly recommend checking out _bcachefs: Principles of | Operation_.[0] | | Also, the original developer of bcachefs (as well as bcache), | Kent Overstreet posts status updates from time to time on his | Patreon page.[1] | | [0] https://bcachefs.org/bcachefs-principles-of-operation.pdf | | [1] https://www.patreon.com/bcachefs | AceJohnny2 wrote: | Thanks for the links! | | I was wondering if bcachefs is architectured with NAND-flash | SSD hardware in mind (as recently highlighted on HN in the "Is | Sequential IO Dead In The Era Of The NVMe Drive" article [1] | [2]), to optimize IO and hardware lifecycle. | | Skimming through the "bcachefs: Principles Of Operation" PDF, | it appears the answer is no. | | [1] https://jack-vanlightly.com/blog/2023/5/9/is-sequential- | io-d... | | [2] https://news.ycombinator.com/item?id=35878961 | koverstreet wrote: | It is. There's also plans for ZNS SSD support. ___________________________________________________________________ (page generated 2023-05-11 23:01 UTC)