[HN Gopher] Bcachefs - A New COW Filesystem
       ___________________________________________________________________
        
       Bcachefs - A New COW Filesystem
        
       Author : jlpcsl
       Score  : 263 points
       Date   : 2023-05-11 08:50 UTC (14 hours ago)
        
 (HTM) web link (lore.kernel.org)
 (TXT) w3m dump (lore.kernel.org)
        
       | graderjs wrote:
       | Is there an optimal filesystem or is it all just trade offs? And
       | how far if we come since you know when we were first creating
       | file systems plan nine or whatever to now like it's as there been
       | any sort of technological Leap like killer algorithm it's really
       | improve things?
        
         | dsr_ wrote:
         | An optimal filesystem... for what?
         | 
         | There is no single filesystem which is optimized for
         | everything, so you need to specify things like
         | 
         | cross-platform transportability, network transparency, hardware
         | interfaces, hardware capability, reliability requirements,
         | cost-effectiveness, required features, expected workload,
         | licensing
         | 
         | and what the track record is in the real world.
        
         | the8472 wrote:
         | It's all tradeoffs.
        
       | jacknews wrote:
       | "The COW filesystem for Linux that won't eat your data"
       | 
       | LOL, they know what the problem is at least. I will try it out on
       | some old hard disks. The others (esp. looking at you btrfs) are
       | not good at not losing your entire volumes when disks start to go
       | bad.
        
       | eis wrote:
       | I really hope Linux can get a modern FS into common usage (as in
       | default FS for most distros). After more than a decade ZFS and
       | BTRFS didn't go anywhere. Something that's just there as a
       | default, is stable, performs decently (at least on ext4 level)
       | and brings modern features like snapshots. Bcachefs seems to have
       | a decent shot.
       | 
       | What I'd like to see even more though would be a switch from the
       | existing posix based filesystem APIs to a transaction based
       | system. It is way too complicated to do filesystem operations
       | that are not prone to data corruption should there be any issues.
        
         | viraptor wrote:
         | Btrfs is the default on a few systems already like Fedora,
         | suse, Garuda, easynas, rockstor, and some others. It's not the
         | default in Ubuntu and Debian, but I wouldn't say it didn't go
         | anywhere either.
        
           | jadbox wrote:
           | I'm using BTRfs on Fedora (by default install) and it's been
           | great over the last year.
           | 
           | The only thing to be aware of is to disable CoW/Hashing on
           | database stores or streaming download folders. Otherwise
           | it'll rehash each file update, which isn't needed.
        
             | giantrobot wrote:
             | Why is it hashing files and not blocks? If a block is
             | hashed and written there's no need to touch it again.
        
             | dnzm wrote:
             | I've been running it on my NAS-slash-homeserver for... 5 or
             | 6 years now, I think. Root on a single SSD, data on a few
             | HDDs in RAID1. It's been great so far. My desktops are all
             | btrfs too, and the integration between OpenSUSE's package
             | manager and btrfs snapshots has been useful more than once.
        
           | curt15 wrote:
           | It looks like Fedora's adoption of btrfs unearthed another
           | data corruption bug recently:
           | https://bugzilla.redhat.com/show_bug.cgi?id=2169947
        
             | hackernudes wrote:
             | Wow, that's funny - almost looks like bcachefs explaining a
             | similar issue here https://lore.kernel.org/lkml/20230509165
             | 657.1735798-7-kent.o...
        
         | johnisgood wrote:
         | HAMMER2 supports snapshots. I do not have any experiences with
         | it though.
        
           | aidenn0 wrote:
           | Is HAMMER2 supported on Linux? I thought it was Dragonfly
           | only.
        
             | joshbaptiste wrote:
             | yup Dragonfly and soon NetBSD
             | https://www.phoronix.com/news/NetBSD-HAMMER2-Port
        
       | renewiltord wrote:
       | Is there a high-performance in-kernel FS that acts as a
       | hierarchical cache that I can export over NFS? Presently I use
       | `catfs` over `goofys` and then I export the `catfs` mount.
        
         | cdavid wrote:
         | Not sure I understand your use case, but if you have to use
         | nfs, cachefilesd is very effective for read heavy workload:
         | https://access.redhat.com/documentation/en-us/red_hat_enterp...
        
       | throw0101b wrote:
       | > _These are RW btrfs-style snapshots_
       | 
       | There's a word for 'RW snapshots': clones. E.g.
       | 
       | * https://docs.netapp.com/us-en/ontap/task_admin_clone_data.ht...
       | 
       | * http://doc.isilon.com/onefs/9.4.0/help/en-us/ifs_t_clone_a_f...
       | 
       | * https://openzfs.github.io/openzfs-docs/man/8/zfs-clone.8.htm...
       | 
       | * http://www.voleg.info/lvm2-clone-logical-volume.html
       | 
       | In every other implementation I've come across the word
       | "snapshot" is about read-only copies. I'm not sure why btrfs (and
       | now bcachefs?) thinks it needs to muddy the nomenclature waters.
        
         | webstrand wrote:
         | Cloning can also mean simple duplication. I think calling it a
         | RW snapshot is clearer because a snapshot generally doesn't
         | mean simple duplication.
        
           | throw0101a wrote:
           | > _I think calling it a RW snapshot_ [...]
           | 
           | So what do you call a RO snapshot? Or do you now need to
           | write the prefix "RO" and "RW" _everywhere_ when referring to
           | a  "snapshot"?
           | 
           | How do CLI commands work? Will you have "btrfs snapshot" and
           | then have to always define whether you want RO or RW on every
           | invocation? This smells like git's bad front-end CLI
           | porcelain all over again (regardless of how nice the back-end
           | plumbing may be).
           | 
           | This is a solved problem with an established nomenclature
           | IMHO: just use the already-existing nouns/CLI-verbs of
           | "snapshot" and "clone".
           | 
           | > [...] _is clearer because a snapshot generally doesn 't
           | mean simple duplication._
           | 
           | A snapshot generally means a static copy of the data, with
           | bachefs (and ZFS and btrfs) being CoW then it implies new
           | copies are not needed unless/until the source is altered.
           | 
           | If you want deduplication use "dedupe" in your CLI.
        
             | nextaccountic wrote:
             | > So what do you call a RO snapshot
             | 
             | There should be no read-only snapshot: it's just a writable
             | snapshot where you don't happen to perform a write
        
               | throw0101a wrote:
               | > _There should be no read-only snapshot: it 's just a
               | writable snapshot where you don't happen to perform a
               | write_
               | 
               | So when malware comes along and goes after the live copy,
               | and happens to find the 'snapshot', but is able to hose
               | that snapshot data as well, the solution is to go to
               | tape?
               | 
               | As opposed to any other file system that implements read-
               | only snapshots, if the live copy is hosed, one can simply
               | clone/revert to the read-only copy. (This is not a
               | hypothetical: I've done this personally.)
               | 
               | (Certainly one should have off-device/site backups, but
               | being able to do a quick revert is great for MTTR.)
        
       | londons_explore wrote:
       | I would like to see filesystems benchmarked for robustness.
       | 
       | Specifically, robustness to everything around them not performing
       | as required. For example, imagine an evil SSD which had a 1%
       | chance of rolling a sector back to a previous version, a 1%
       | chance of saying a write failed when it didn't, a 1% chance of
       | writing data to the wrong sector number, a 1% chance of flipping
       | some bits in the written data, and a 1% chance of disconnecting
       | and reconnecting a few seconds later.
       | 
       | Real SSD's have bugs that make them do all of these things.
       | 
       | Given this evil SSD, I want to know how long the filesystem can
       | keep going serving the users usecase.
        
         | crest wrote:
         | A 1% error rates for corrupting other blocks is prohibitively.
         | A file system would have do extensive forward error correction
         | in addition to checksumming to have a chance to work with this.
         | It would also have to perform a lot of background scrubbing to
         | stay ahead of the rot. While interesting to model and maybe
         | even relevant as a research problem given the steadily
         | worsening bandwidth to capacity ratio of affordable bulk
         | storage I don't expect there are many users willing to accept
         | the overhead required to come even close to a usable file
         | system an a device as bad as the one you described.
        
           | comex wrote:
           | Well, it's sort of redundant. According to [1], the raw bit
           | error rate of the flash memory inside today's SSDs is already
           | in the 0.1%-1% range. And so the controllers inside the SSDs
           | already do forward error correction, more efficiently than
           | the host CPU could do it since they have dedicated hardware
           | for it. Adding another layer of error correction at the
           | filesystem level could help with some of the remaining
           | failure modes, but you would still have to worry about RAM
           | bitflips after the data has already been read into RAM and
           | validated.
           | 
           | [1] https://ieeexplore.ieee.org/document/9251942
        
           | simcop2387 wrote:
           | ZFS will do this. Give it a RAIDz-{1..3} setup and you've got
           | the FEC/Parity calculations that happen. Every read has it's
           | checksum checked, and when reading if it finds issues it'll
           | start resilvering them asap. You are of course right in that
           | it will eventually start getting to worse and worse
           | performance as it's having to do much more rewriting and full
           | on scrubbing if there are constant amounts of errors
           | happening but it can generally handle things pretty well.
        
             | Dylan16807 wrote:
             | I don't know...
             | 
             | Let's say you have 6 drives in raidz2. If you have a 1%
             | silent failure chance per block, then writing a set of 6
             | blocks has a 0.002% silent failure rate. And ZFS doesn't
             | immediately verify writes, so it won't try again.
             | 
             | If that's applied to 4KB blocks, then we have a 0.002%
             | failure rate per 16KB of data. It will take about 36
             | thousand sets of blocks to reach 50% odds of losing data,
             | which is only half a gigabyte. If we look at the larger
             | block ZFS uses internally then it's a handful of gigabytes.
             | 
             | And that's without even adding the feature where writing
             | one block will corrupt other blocks.
        
         | toxik wrote:
         | With this level of adversarial problems, you'd better formulate
         | your whole IO stack as a two-player minimax game.
        
         | throw0101a wrote:
         | > _For example, imagine an evil SSD which had a 1% chance of
         | rolling a sector back to a previous version, a 1% chance of
         | saying a write failed when it didn 't, a 1% chance of writing
         | data to the wrong sector number, a 1% chance of flipping some
         | bits in the written data, and a 1% chance of disconnecting and
         | reconnecting a few seconds later._
         | 
         | There are stories from the ZFS folks of dealing with these
         | issues and things ran just fine.
         | 
         | While not directly involved with ZFS development (IIRC), Bryan
         | Cantrill was very 'ZFS-adjacent' since he used Solaris-based
         | systems for a lot of his career, and he has several rants about
         | firmware that you can find online.
         | 
         | A video that went viral many years ago, with Cantrill and
         | Brendan Gregg, is "Shouting in the Datacenter":
         | 
         | * https://www.youtube.com/watch?v=tDacjrSCeq4
         | 
         | * https://www.youtube.com/watch?v=lMPozJFC8g0 (making of)
        
         | amluto wrote:
         | ISTM one could design a filesystem as a Byzantine-fault-
         | tolerant distributed system that happens to have many nodes
         | (disks) partially sharing hardware (CPU, memory, etc). The
         | result would not look that much like RAID, but would look quite
         | a bit like Ceph and its relatives.
         | 
         | Bonus points for making the result efficiently support multiple
         | nodes, each with multiple disks.
        
         | PhilipRoman wrote:
         | I have basically the opposite problem. I've been looking for a
         | filesystem that maximizes performance (and minimizes actual
         | disk writes) at the cost of reliability. As long as it loses
         | all my data less than once a week, I can live with it.
        
           | viraptor wrote:
           | Have you tried allowing ext4 to ignore all safety?
           | data=writeback, barrier=0, bump up dirty_ratio, tune
           | ^has_journal, maybe disable flushes with
           | https://github.com/stewartsmith/libeatmydata
        
             | PhilipRoman wrote:
             | Thanks, this looks promising
        
             | the8472 wrote:
             | You can also add journal_async_commit,noauto_da_alloc
             | 
             | > maybe disable flushes with
             | https://github.com/stewartsmith/libeatmydata
             | 
             | overlayfs has a volatile mount option that has that effect.
             | So stacking a volatile overlayfs with the upper and lower
             | on the same ext4 could provide that behavior even for
             | applications that can't be intercepted with LD_PRELOAD
        
           | seunosewa wrote:
           | How would you cope with losing all you data once a week?
        
             | PhilipRoman wrote:
             | "once a week" was maybe a too extreme example. For my case
             | specifically: lost data can be recomputed. Basically a
             | bunch of compiler outputs, indexes and analysis results on
             | the input files, typically an order of magnitude larger
             | than the original files themselves.
             | 
             | Any files that are important would go to a separate, more
             | reliable filesystem (or uploaded elsewhere).
        
               | kadoban wrote:
               | On top of other suggestions I've seen you get already,
               | raid0 might be worth looking at. That has some good speed
               | vs reliability tradeoffs (in the direction you want).
        
             | dur-randir wrote:
             | Some video production workflows are run on 4xraid0 just for
             | the speed - it fails rarely enough and intermediate output
             | is just re-created.
        
               | desro wrote:
               | Can confirm. When I can't work off my internal MacBook
               | storage, my working drive is a RAID0 NVME array over
               | Thunderbolt. Jobs setup in Carbon Copy Cloner make
               | incremental hourly backups to a NAS on site as well as a
               | locally-attached RAID6 HDD array.
               | 
               | If the stripe dies, worst case is I lose up to one hour
               | of work, plus let's say another hour copying assets back
               | to a rebuilt stripe.
               | 
               | There are _so many_ MASSIVE files created in intermediate
               | stages of audiovisual [post] production.
        
           | sph wrote:
           | It all depends on how much reliability are you willing to
           | give up for performance.
           | 
           | Because I have the best storage performance you'll ever find
           | anywhere, 100% money-back guaranteed: write to /dev/null. It
           | comes with the downside of 0% reliability.
           | 
           | You can write to a disk without a file-system, sequentially,
           | until space ends. Quite fast actually, and reliable, until
           | you reach the end, then reliability drops dramatically.
        
             | [deleted]
        
             | PhilipRoman wrote:
             | Yeah, I've had good experience with bypassing fs layer in
             | the past, especially on a HDD the gains can be insane. But
             | it won't help as I still need a more-or-less posixy
             | read/write API.
             | 
             | P.S. I'm fairly certain that /dev/null would lose my data a
             | bit more often than once a week.
        
             | jasomill wrote:
             | Trouble is you can't use /dev/null as a filesystem, even
             | for testing.
             | 
             | On a related note, though, I've considered the idea of
             | creating a "minimally POSIX-compliant" filesystem that
             | randomly reorders and delays I/O operations whenever
             | standards permit it to do so, along with any other odd
             | behavior I can find that remains within the _letter_ of
             | published standards (unusual path limitations, support for
             | exactly two hard links per file, sparse files that require
             | holes to be aligned on 4,099-byte boundaries in spite of
             | the filesystem 's reported 509-byte block size, etc., all
             | properly reported by applicable APIs).
        
           | dralley wrote:
           | Cue MongoDB memes
        
             | ilyt wrote:
             | Probably just using it for cache of some kind
        
           | magicalhippo wrote:
           | Tongue-in-cheek solution: use a ramdisk[1] for dm-writecache
           | in writeback mode[2]?
           | 
           | [1]: https://www.kernel.org/doc/Documentation/blockdev/ramdis
           | k.tx...
           | 
           | [2]: https://blog.delouw.ch/2020/01/29/using-lvm-cache-for-
           | storag...
        
           | bionade24 wrote:
           | Not sure if this is feasible, but have you considered dumping
           | binary on the raw disk like done with tapes?
        
             | crabbone wrote:
             | You could even use partitions as files. You could only have
             | 128 files, but maybe that's enough for OP?
        
           | rwmj wrote:
           | Don't you want a RAM disk for this? It'll lose all your data
           | (reliably!) when you reboot.
           | 
           | You could also look at this:
           | https://rwmj.wordpress.com/2020/03/21/new-nbdkit-remote-
           | tmpf... We use it for Koji builds, where we actually don't
           | care about keeping the build tree around (we persist only the
           | built objects and artifacts elsewhere). This plugin is pretty
           | fast for this use case because it ignores FUA requests from
           | the filesystem. Obviously don't use it where you care about
           | your data.
        
             | ilyt wrote:
             | > Don't you want a RAM disk for this? It'll lose all your
             | data (reliably!) when you reboot.
             | 
             | Uhh hello, pricing ?
        
               | fwip wrote:
               | Depends on how much space you need.
        
           | mastax wrote:
           | You should be able to do this with basically any file system
           | by using the mount options `async`(default) `noatime`
           | disabling journalling, and massively increasing
           | vm.dirty_background_ratio, vm.dirty_ratio, and
           | vm.dirty_expire_centisecs.
        
         | rwmj wrote:
         | nbdkit memory 10G --filter=error error-rate=1%
         | 
         | ... and then nbd-loop-mount that as a block device and create
         | your filesystem on top.
         | 
         | Notes:
         | 
         | We're working on making a ublk interface to nbdkit plugins so
         | the loop mount wouldn't be needed.
         | 
         | There are actually better ways to use the error filter, such as
         | triggering it from a file, see the manual:
         | https://www.libguestfs.org/nbdkit-error-filter.1.html
         | 
         | It's an interesting idea to have an "evil" filter that flips
         | bits at random. I might write that!
        
           | antongribok wrote:
           | How does this compare with dm-flakey [0] ?
           | 
           | [0]: https://www.kernel.org/doc/html/latest/admin-
           | guide/device-ma...
        
           | Dwedit wrote:
           | Suddenly I'm reminded of the time someone made a Bad Internet
           | simulator (causing packet loss or other problems) and named
           | the program "Comcast".
        
         | seized wrote:
         | I've had ZFS pools survive (at different times over the span of
         | years):
         | 
         | - A RAIDz1 (RAID5) pool with a second disk start failing while
         | rebuilding from an earlier disk failure (data was fine)
         | 
         | - A water to air CPU cooler leaking, CPU overheated and the
         | water shorted and killed the HBA running a pool (data was fine)
         | 
         | - An SFF-8088 cable half plugged in for months, pool would
         | sometimes hiccup, throw off errors, take a while to list files,
         | but worked fine after plugging it in properly (data was fine
         | after)
         | 
         | Then the usual disk failures which are a non-event with ZFS.
        
           | gigatexal wrote:
           | This is why I always opt for ZFS.
        
           | ilyt wrote:
           | I recovered from 3 disk RAID 6 failure (which itself was
           | organization failure driven...) in linux's mdadm... ddrescue
           | to the rescue, I guess I got "lucky" the bad blocks didn't
           | happen in same place on all drives (one died, other started
           | returning bad blocks), but chance for that to happen are
           | infinitesly small in the first place
           | 
           | So _shrug_
        
             | avianlyric wrote:
             | How do you know you got lucky with corrupted blocks?
             | 
             | mdadm doesn't checksum data, and just trusts the HDD to
             | either return correct data, or an error. But HDDs return
             | incorrect data all the time, their specs even tell you how
             | much incorrect data they'll return, and for anything over
             | about 8TB you're basically guaranteed some silent
             | corruption if you read every byte.
        
             | johnmaguire wrote:
             | Yes, I also had a "1 disk failed, second disk failed during
             | rebuild" event (like the parent, not your story) with mdadm
             | & RAID 6 with no issues.
             | 
             | People seem to love ZFS but I had no issues running mdadm.
             | I'm now running a ZFS pool and so far it's been more work
             | (and things to learn), requires a lot more RAM, and the
             | benefits are... escaping me.
        
               | ysleepy wrote:
               | Are you sure the data survived? ZFS is sure and proves it
               | with checksums over metadata and data. I don't know madm
               | well enough to know if it does this too.
        
         | 112233 wrote:
         | Please, where can i read more about this. I remember bricking
         | OCZ drive by setting ATA password, as was fashionable to do
         | back then, but 1% writes going to wrong sector - what are these
         | drives, fake sd cards from aliexpress?
         | 
         | Like, which manufacturer goes, like "tests show we cannot write
         | more that 400kB without corrupting drive, let us ship this!" ?
        
           | jlokier wrote:
           | _> Like, which manufacturer goes, like  "tests show we cannot
           | write more that 400kB without corrupting drive, let us ship
           | this!" ?_
           | 
           | According to https://www.sqlite.org/howtocorrupt.html there
           | are such drives:
           | 
           |  _4.2. Fake capacity USB sticks_
           | 
           |  _There are many fraudulent USB sticks in circulation that
           | report to have a high capacity (ex: 8GB) but are really only
           | capable of storing a much smaller amount (ex: 1GB). Attempts
           | to write on these devices will often result in unrelated
           | files being overwritten. Any use of a fraudulent flash memory
           | device can easily lead to database corruption, therefore.
           | Internet searches such as "fake capacity usb" will turn up
           | lots of disturbing information about this problem._
           | 
           | Bit flips amd overwriting wrong sectors from unrelated files
           | being written are also mentioned. You might think this sort
           | of thing is just cheap USB flash drives, bit I've been told
           | about NVMe SSDs violating their guarantees and causing very
           | strange corruption patterns too. Unfortunately when the cause
           | is a bug in the storage's device own algorithms, the rare
           | corruption event is not necessarily limited to a few bytes or
           | just 1 sector here or there, nor to just the sectors being
           | written.
           | 
           | I don't know how prevalent any of these things are really.
           | The sqlite.org says "most" consumer HDDs lie about commiting
           | data to the platter before reporting they've done so, but
           | when I worked with ext3 barriers back in the mid 2000s, the
           | HDDs I tested had timing consistent with flushing write cache
           | correctly, and turning off barriers did in fact lead to
           | observable filesystem corruption on power loss, which was
           | prevented by turning on barriers. The barriers were so
           | important they made the difference between embedded devices
           | that could be reliably power cycled, vs those which didn't
           | reliably recover on boot.
        
             | Dylan16807 wrote:
             | > Fake capacity USB sticks
             | 
             | Those drives have a sudden flip from 0% corruption to
             | 95-100% corruption when you hit their limits. I wouldn't
             | count that as the same thing. And you can't reasonably
             | expect anything to work on those.
             | 
             | > The sqlite.org says "most" consumer HDDs lie about
             | commiting data to the platter before reporting they've done
             | so, but when I worked with ext3 barriers back in the mid
             | 2000s, the HDDs I tested had timing consistent with
             | flushing write cache correctly
             | 
             | Losing a burst of writes every once in a while also
             | manifests extremely differently from steady 1% loss and
             | needs to be handled in a very different way. And if it's at
             | power loss it might be as simple as rolling back the last
             | checkpoint during mount if verification fails.
        
           | londons_explore wrote:
           | Writes going to the wrong sector are usually wear levelling
           | algorithms gone wrong. Specifically, it normally means the
           | information about which logical sector maps to which physical
           | sector was updated not in sync with the actual writing of the
           | data. This is a common performance 'trick' - by delaying and
           | aggregating these bookkeeping writes, and taking them off the
           | critical path, you avoid writing so much data and the user
           | sees lower latency.
           | 
           | However, if something like a power failure or firmware crash
           | happens, and the bookkeeping writes never happen, then the
           | end result that the user sees after a reboot is their data
           | written to the wrong sector.
        
         | jeffbee wrote:
         | But that would require Linux hackers to read and understand the
         | literature, and absorb the lessons of industry practice,
         | instead of just blurting out their aesthetic ideal of a
         | filesystem.
        
           | sangnoir wrote:
           | Isn't Linux the most deployed OS in industry (by
           | practitioners)? Are the Linux hyperscalers hiding their FS
           | secret-sauce, or perhaps the "aesthetic ideal" filesystems
           | available to Linux good enough?
        
             | jeffbee wrote:
             | I imagine the hyperscalers are all handling integrity at a
             | higher level where individual filesystems on single hosts
             | are irrelevant to the outcome. In such applications, any
             | old filesystem will do.
             | 
             | For people who do not have application-level integrity, the
             | systems that offer robustness in the face of imperfect
             | storage devices are sold by companies like NetApp, which a
             | lot of people would sneer at but they've done the math.
        
               | Datagenerator wrote:
               | Seen NetApp's boot messages, it's FreeBSD under the hood
        
               | seized wrote:
               | As is EMC Isilon.
        
               | jeffbee wrote:
               | They have their own filesystem with all manner of
               | integrity protection.
               | https://en.wikipedia.org/wiki/Write_Anywhere_File_Layout
        
         | deathanatos wrote:
         | On the whole, I'm not sure that a FS can work around such a
         | byzantine drive, at least not if its the only such drive in the
         | system. I'd rather FSes not try to pave over these: these disks
         | are faulty, and we need to demand, with our wallets, better
         | quality hardware.
         | 
         | > _1% chance of disconnecting and reconnecting a few seconds
         | later_
         | 
         | I actually have such an SSD. It's unusable, when it is in that
         | state. The FS doesn't corrupt the data, but it's hard for the
         | OS to make forward progress, and obviously a lot of writes fail
         | at the application level. (It's a shitty USB implementation on
         | the disk: it disconnects if it's on a USB-3 capable port, and
         | too much transfer occurs. It's USB-2, though; connecting it to
         | a USB-2-only port makes it work just fine.)
        
           | mprovost wrote:
           | At the point where it disappears for seconds it's a
           | distributed system not an attached disk. At this point you
           | have to start applying the CAP theorem.
           | 
           | At least in Unix the assumption is that disks are always
           | attached (and reliable...) so write errors don't typically
           | bubble up to the application layer. This is why losing an NFS
           | mount typically just hangs the system until it recovers.
        
             | deathanatos wrote:
             | > _At least in Unix the assumption is that disks are always
             | attached (and reliable...)_
             | 
             | I want to say a physical disk being yanked from the system
             | (essentially what was happening, as far as the OS could
             | tell) does cause I/O errors in Linux? I could be wrong
             | though, this isn't exactly something I try to exercise.
             | 
             | As for it being a distributed system ... I suppose? But
             | that's what an FS's log is for: when the drive reconnects,
             | there will either be a pending WAL entry, or not. If there
             | is, the write can be persisted, otherwise, it is lost. But
             | consistency should still happen.
             | 
             | Now, an _app_ might not be ready for that, but that 's an
             | app bug.
             | 
             | But it can always happen that the power goes out, which in
             | my situation is equivalent to a disk yank. There's also
             | small children, the SO tripping over a cable, etc.
             | 
             | But these are different failure modes from some of what the
             | above post listed, such as disks undoing acknowledged
             | writes, or lying about having persisted the write. (Some of
             | the examples are byzantine, some are not.)
        
       | [deleted]
        
       | brnt wrote:
       | Erasure coding at the filesystem level? Finally!
       | 
       | I've not dared try bcachefs out though, I'm quite wary of data
       | loss, even on my laptop. Does anyone have experience to share?
        
         | BlackLotus89 wrote:
         | Had(have) a Laptop that crashed reproducible when touching it
         | wrong. Had a few btrfs corruptions on it and after a while got
         | enough. Have it running bcachefs as rootfs for a few years now
         | and had no issue whatsoever with it. Home is still btrfs (for
         | reasons) and had no data loss on that either. Only problems I
         | had were fixed through booting and mounting it through a rescue
         | system (no fsck necessary) had that twice in 2 years or so. Was
         | too lazy to check what the bcachefs hook (aur package) does
         | wrong.
         | 
         | Edit: Reasons for home being btrfs. I set this up a long
         | fucking time ago and it was more or less meant as a stresstest
         | for bcachefs. Since I didn't want data loss on important data
         | (like my home) I left my home as btrfs
        
       | orra wrote:
       | Oh! This is very exciting. Bcachefs could be the next gen
       | filesystem that Linux needs[1].
       | 
       | Advantages over other filesystems:
       | 
       | * ext4 or xfs -- these two don't use ECC to protect your data,
       | only the filesystem metadata
       | 
       | * zfs -- zfs is technically great, but binary distribution of the
       | zfs code is tricky, because the CDDL is GPL incompatible
       | 
       | * btrfs -- btrfs still doesn't have reliable RAID5
       | 
       | [1] It's been in development for a number of years. It now being
       | proposed for inclusion in the mainline kernel is a major
       | milestone.
        
         | vladvasiliu wrote:
         | > * zfs -- zfs is technically great, but binary distribution of
         | the zfs code is tricky, because the CDDL is GPL incompatible
         | 
         | Building your own ZFS module is easy enough, for example on
         | Arch with zfs-dkms.
         | 
         | But there's also the issue of compatibility. Sometimes kernel
         | updates will break ZFS. Even minor ones, 6.2.13 IIRC broke it,
         | whereas 6.2.12 was fine.
         | 
         | Right now, 6.3 seems to introduce major compatibility problems.
         | 
         | ---
         | 
         | edit: looking through the openzfs issues, I was likely thinking
         | of 6.2.8 breaking it, where 6.2.7 was fine. Point stands,
         | though. https://github.com/openzfs/zfs/issues/14658
         | 
         | Regarding 6.3 support, it apparently is merged in the master
         | branch, but no release as of yet.
         | https://github.com/openzfs/zfs/issues/14622
        
           | kaba0 wrote:
           | It might help someone: nixos can be configured to always use
           | the latest kernel version that is compatible with zfs, I
           | believe its
           | config.boot.zfs.package.latestCompatibleLinuxPackages .
        
           | bjoli wrote:
           | How is the legal situation of doing that? If I had a company
           | I wouldn't want to get in trouble with any litigious
           | companies.
        
             | boomboomsubban wrote:
             | Unless you're distributing, I don't see how anybody could
             | do anything. Personal (or company wide) use has always
             | allowed the mixing of basically any licenses.
             | 
             | The worst case scenarios would be something like Ubuntu
             | being unable to provide compiled modules, but dkms would
             | still be fine. Or the very unlikely ZFS on Linux getting
             | sued, but that would involve a lengthy trial that would
             | allow you to move away from Open ZFS.
        
               | chasil wrote:
               | The danger is specifically to the copyright holders of
               | Linux - the authors who have code in the kernel. If they
               | do not defend their copyright, then it is not strong and
               | can be broken in certain scenarios.
               | 
               | "Linux copyright holders in the GPL Compliance Project
               | for Linux Developers believe that distribution of ZFS
               | binaries is a GPL violation and infringes Linux's
               | copyright."
               | 
               | Linux bundling ZFS code would bring this text against the
               | GPL: "You may not offer or impose any terms on any
               | Covered Software in Source Code form that alters or
               | restricts the applicable version of [the CDDL]."
               | 
               | Ubuntu distributes ZFS as an out of tree module, which
               | taints the kernel at immediately at installation.
               | Hopefully, this is enough to prevent a great legal
               | challenge.
               | 
               | https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/
        
               | boomboomsubban wrote:
               | Yes, distribution has legal risks. Use does not, it only
               | has the risk that they are unable to get ZFS distributed.
        
             | vladvasiliu wrote:
             | The ArchZFS project distributes binary kernel images with
             | ZFS integrated. I don't know what the legal situation is
             | for that.
             | 
             | In my case, the Arch package is more of a "recipe maker".
             | It fetches the Linux headers and the zfs source code and
             | compiles this for local use. As far as they are concerned,
             | there is no distribution of the resulting artifact. IANAL,
             | but I think if there's an issue with that, then OpenZFS is
             | basically never usable under Linux.
             | 
             | Other companies distributed kernels with zfs support
             | directly, such as Ubuntu. I don't recall there being news
             | of them being sued over this, but maybe they managed to
             | work something out.
        
               | 5e92cb50239222b wrote:
               | archzfs does not distribute any kernel images, they only
               | provide pre-built modules for the officially supported
               | kernels.
        
             | dsr_ wrote:
             | IANAL.
             | 
             | Oracle is very litigious. However, OpenZFS has been
             | releasing code for more than a decade. Ubuntu shipped
             | integrated ZFS/Linux in 2016. It's certain that Oracle
             | knows all about it and has decided that being vague is more
             | in their interests than actually settling the matter.
             | 
             | On my list of potential legal worries, this is not a
             | priority for me.
        
               | cduzz wrote:
               | I would add to this "IANAL But" list
               | 
               | https://aws.amazon.com/fsx/openzfs/
               | 
               | So -- AWS / Amazon are certainly big enough to have
               | reviewed the licenses and have some understanding of
               | potential legal risks of this.
        
           | orra wrote:
           | You're right that DKMS is fairly easy (at least until you
           | enable secure boot).
           | 
           | > Even minor ones, 6.2.13 IIRC broke it, whereas 6.2.12 was
           | fine.
           | 
           | Interesting!
           | 
           | It's just a shame the license has hindered adoption. Ubuntu
           | were shipping binary ZFS modules at one point, but they have
           | walked back from that.
        
             | vladvasiliu wrote:
             | > You're right that DKMS is fairly easy (at least until you
             | enable secure boot).
             | 
             | Still easy. Under Arch, the kernel image isn't signed, so
             | if you enable secure boot you need to fiddle with signing
             | on your own. At that point, you can just sign the kernel
             | once the module is built. Works fine for me.
        
             | mustache_kimono wrote:
             | > Ubuntu were shipping binary ZFS modules at one point, but
             | they have walked back from that.
             | 
             | This is incorrect? Ubuntu is still shipping binary modules.
        
               | orra wrote:
               | Right, but various things point to ZFS being de facto
               | deprecated: https://www.omgubuntu.co.uk/2023/01/ubuntu-
               | zfs-support-statu...
        
               | mustache_kimono wrote:
               | > various things point to ZFS being de facto deprecated
               | 
               | I'm not sure that's the case? Your link points to the ZFS
               | _on root install_ being deprecated on the _desktop_. I 'm
               | not sure what inference you/we can draw from that
               | considering ZFS is a major component to LXD, and Ubuntu
               | and Linux's sweet spot is as a server OS.
               | 
               | > Ubuntu were shipping binary ZFS modules at one point,
               | but they have walked back from that.
               | 
               | Not to be persnickety, but this was your claim, and
               | Ubuntu is still shipping ZFS binary modules on all it's
               | current releases.
        
               | orra wrote:
               | Yeah, my wording was clumsy, but thanks for assuming good
               | faith. I essentially meant their enthusiasm had wained.
               | 
               | It's good you can give reasons ZFS is still important to
               | Ubuntu on the server, although as a desktop user I'm sad
               | nobody wants to ship ZFS for the desktop.
        
         | ilyt wrote:
         | > [1] It's been in development for a number of years. It now
         | being proposed for inclusion in the mainline kernel is a major
         | milestone.
         | 
         | not a measure of quality in the slightest. btrfs had some
         | serious bugs over the years despise being in mainline
        
           | orra wrote:
           | True, but bcachefs gives the impression of being better
           | designed, nor being rushed upstream. I think it helps that
           | bcachefs evolved from bcache.
        
         | the8472 wrote:
         | > zfs is technically great
         | 
         | It's only great due to the lack of competitors in the
         | checksummed-CoW-raid category. It lacks a bunch of things:
         | Defrag, Reflinks, On-Demand Dedup, Rebalance (online raid
         | geometry change, device removal, device shrink). It also wastes
         | RAM due to page cache + ARC.
        
           | macdice wrote:
           | Reflinks and copy_file_range() are just landing in OpenZFS
           | now I think? (Block cloning)
        
             | pongo1231 wrote:
             | Block cloning support has indeed recently landed in git and
             | already allows for reflinks under FreeBSD. Still has to be
             | wired up for Linux though.
        
               | mustache_kimono wrote:
               | Really excited about this.
               | 
               | Once support hits in Linux, a little app of mine[0] will
               | support block cloning for its "roll forward" operation,
               | where all previous snapshots are preserved, but a
               | particular snapshot is rolled forward to the live
               | dataset. Right now, data is simply diff copied in chunks.
               | When this support hits, there will be no need to copy any
               | data. Blocks written to the live dataset can just be
               | references to the underlying snapshot blocks, and no
               | extra space will need to be used.
               | 
               | [0]: https://github.com/kimono-koans/httm
        
               | nextaccountic wrote:
               | What does it mean to roll forward? I read the linked
               | Github and I don't get what is happening
               | 
               | > Roll forward to a previous ZFS snapshot, instead of
               | rolling back (this avoids destroying interstitial
               | snapshots):                    sudo httm --roll-forward=r
               | pool/scratch@snap_2023-04-01-15:26:06_httmSnapFileMount
               | [sudo] password for kimono:         httm took a pre-
               | execution snapshot named: rpool/scratch@snap_pre_2023-04-
               | 01-15:27:38_httmSnapRollForward         ...         httm
               | roll forward completed successfully.         httm took a
               | post-execution snapshot named: rpool/scratch@snap_post_20
               | 23-04-01-15:28:40_:snap_2023-04-01-15:26:06_httmSnapFileM
               | ount:_httmSnapRollForward
        
               | mustache_kimono wrote:
               | From the help and man page[0]:                   --roll-
               | forward="snap_name"              traditionally 'zfs
               | rollback' is a destructive operation, whereas httm roll-
               | forward is non-destructive.  httm will copy only files
               | and their attributes that have changed since a specified
               | snapshot, from that snapshot, to its live dataset.  httm
               | will also take two precautionary snapshots, one before
               | and one after the copy.  Should the roll forward fail for
               | any reason, httm will roll back to the pre-execution
               | state.  Note: This is a ZFS only option which requires
               | super user privileges.
               | 
               | I might also add 'zfs rollback' is a destructive
               | operation because it destroys snapshots between the
               | current live version of the filesystem and the rollback
               | snapshot target (the 'interstitial' snapshots). Imagine
               | you have a ransom-ware installed and you _need_ to
               | rollback, but you want to view the ransomware 's
               | operations through snapshots for forensic purposes. You
               | can do that.
               | 
               | It's also faster than a checksummed rsync, because it
               | makes a determination based on the underlying ZFS
               | checksums, or more accurate than a non-checksummed rsync.
               | 
               | This is a relatively minor feature re: httm. I recommend
               | installing and playing around with it a bit.
               | 
               | [0]: https://github.com/kimono-
               | koans/httm/blob/master/httm.1
        
               | nextaccountic wrote:
               | What I don't understand is: aren't zfs snapshots
               | writable, like in btrfs?
               | 
               | If I wanted to rollback the live filesystem into a
               | previous snapshot, why couldn't I just start writing into
               | the snapshot instead? (Or create another snapshot that is
               | a clone of the old one, and write into it)
        
               | throw0101a wrote:
               | > _What I don 't understand is: aren't zfs snapshots
               | writable, like in btrfs?_
               | 
               | ZFS snapshots, following the historic meaning of
               | "snapshot", are read-only. ZFS supports _cloning_ of a
               | read-only snapshot to a writable volume /file system.
               | 
               | * https://openzfs.github.io/openzfs-docs/man/8/zfs-
               | clone.8.htm...
               | 
               | Btrfs is actually the one 'corrupting' the already-
               | accepted nomenclature of snapshots meaning a read-only
               | copy of the data.
               | 
               | I would assume the etymology of the file system concept
               | of a "snapshot" derives from photography, where something
               | is frozen at a particular moment of time:
               | 
               | > _In computer systems, a snapshot is the state of a
               | system at a particular point in time. The term was coined
               | as an analogy to that in photography._ [...] _To avoid
               | downtime, high-availability systems may instead perform
               | the backup on a snapshot--a read-only copy of the data
               | set frozen at a point in time--and allow applications to
               | continue writing to their data. Most snapshot
               | implementations are efficient and can create snapshots in
               | O(1)._
               | 
               | *
               | https://en.wikipedia.org/wiki/Snapshot_(computer_storage)
               | 
               | * https://en.wikipedia.org/wiki/Snapshot_(photography)
        
           | orra wrote:
           | Sure, there's lots of room for improvement. IIRC, rebalancing
           | might be a WIP, finally?
           | 
           | But credit where credit is due: for a long time, ZFS has been
           | the only fit for purpose filesystem, if you care about the
           | integrity of your data.
        
             | the8472 wrote:
             | Afaik true rebalancing isn't in the works. Some limited
             | add-device and remove-vdev features are in progress but
             | AIUI they come with additional overhead and aren't as
             | flexible.
             | 
             | btrfs and bcachefs rebalance leave your pool as if you had
             | created it from scratch with the existing data and the new
             | layout.
        
           | e12e wrote:
           | > [ZFS is] only great due to the lack of competitors in the
           | checksummed-CoW-raid category.
           | 
           | You forgot robust native encryption, network transparent
           | dump/restore (ZFS send/receive) - and broad platform support
           | (not so much anymore).
           | 
           | For a while you could have a solid FS with encryption support
           | for your USB hd that could be safely used with Linux, *BSD,
           | Windows, Open/FOSS Solaris and MacOS.
        
             | josephg wrote:
             | Is it just the implementation of zfs which is owned by
             | oracle now? I wonder how hard it would be to write a
             | compatible clean room reimplementation of zfs in rust or
             | something, from the spec.
             | 
             | Even if it doesn't implement every feature from the real
             | zfs, it would still be handy for OS compatibility reasons.
        
               | nine_k wrote:
               | I would suppose it would take years of effort? and a lot
               | of testing in search of performance enhancements and
               | elimination of corner cases. Even if the code of the FS
               | itself is created in a provably correct manner (a very
               | tall order even with Rust), real hardware has a lot of
               | quirks which need to be addressed.
        
               | chasil wrote:
               | I wish the btrfs (and perhaps bcachefs) projects would
               | collaborate with OpenZFS to rewrite equivalent code that
               | they all used.
               | 
               | It might take years, but washing Sun out of OpenZFS is
               | the only thing that will free it.
        
               | mustache_kimono wrote:
               | OpenZFS is already free and open source. Linux kernel
               | developers should just stop punching themselves in face.
               | 
               | One way to solve the ZFS issue, Linus Torvalds could call
               | a meeting of project leadership, and say, "Can we all
               | agree that OpenZFS is not a derived work of Linux? It
               | seems pretty obvious to anyone who understands the
               | meaning of copyright term of art 'derived work' and the
               | origin of ZFS ... Good. We shall add a commit which
               | indicates such to the COPYING file [0], like we have for
               | programs that interface at the syscall boundary to clear
               | up any further confusion."
               | 
               | Can you imagine trying to bring a copyright infringement
               | suit (with no damages!) in such an instance?
               | 
               | The ZFS hair shirt is a self imposed by semi-religious
               | Linux wackadoos.
               | 
               | [0]: See, https://github.com/torvalds/linux/blob/master/L
               | ICENSES/excep...
        
               | AshamedCaptain wrote:
               | Even if you were to be able to say that OpenZFS is not a
               | derived work of Linux, all it would allow you to do is to
               | distribute OpenZFS. You would _still_ not be able to
               | distribute OpenZFS + Linux as a combined work.
               | 
               | (I am one of these guys who thinks what Ubuntu is doing
               | is crossing the line. To package two pieces of software
               | whose license forbids you from distributing their
               | combination in a way that "they are not combined but can
               | be combined with a single click" is stretching it too
               | much. )
               | 
               | It would be much simpler for Oracle to simply relicense
               | older versions of ZFS under another license.
        
               | mustache_kimono wrote:
               | > Even if you were to be able to say that OpenZFS is not
               | a derived work of Linux, all it would allow you to do is
               | to distribute OpenZFS. You would _still_ not be able to
               | distribute OpenZFS + Linux as a combined work.
               | 
               | Why? Linus said such modules and distribution were
               | acceptable re: AFS, _an instance which is directly on
               | point_. See: https://lkml.org/lkml/2003/12/3/228
        
               | AshamedCaptain wrote:
               | Where is he saying that you can distribute the combined
               | work? That would not only violate the GPL, it would also
               | violate AFS's license...
               | 
               | The only thing he's saying that there is that he's not
               | even 100% sure whether AFS module is a derived work or
               | not (if it was, it would be a violation _just to
               | distribute the module by itself_!). Go imagine what his
               | opinion will be on someone distributing a kernel already
               | almost pre-linked with ZFS.
               | 
               | Not that it matters, since he's not the license author
               | not even the copyright holder these days...
        
               | mustache_kimono wrote:
               | > Where is he saying that you can distribute the combined
               | work?
               | 
               | What's your reasoning as to why one couldn't, if we grant
               | Linus's reasoning re: AFS as it applies to ZFS?
               | 
               | > Not that it matters, since he's not the license author
               | not even the copyright holder these days...
               | 
               | Linux kernel community has seen fit to give its
               | assurances re: other clarifications/exceptions. See the
               | COPYING file.
        
               | rascul wrote:
               | Linus has some words on this matter:
               | 
               | > And honestly, there is no way I can merge any of the
               | ZFS efforts until I get an official letter from Oracle
               | that is signed by their main legal counsel or preferably
               | by Larry Ellison himself that says that yes, it's ok to
               | do so and treat the end result as GPL'd.
               | 
               | > Other people think it can be ok to merge ZFS code into
               | the kernel and that the module interface makes it ok, and
               | that's their decision. But considering Oracle's litigious
               | nature, and the questions over licensing, there's no way
               | I can feel safe in ever doing so.
               | 
               | > And I'm not at all interested in some "ZFS shim layer"
               | thing either that some people seem to think would isolate
               | the two projects. That adds no value to our side, and
               | given Oracle's interface copyright suits (see Java), I
               | don't think it's any real licensing win either.
               | 
               | https://www.realworldtech.com/forum/?threadid=189711&curp
               | ost...
        
               | mustache_kimono wrote:
               | > Linus has some words on this matter:
               | 
               | I hate to point this out, but this only demonstrates
               | Linux Torvalds doesn't know much about copyright law.
               | Linus could just as easily say "I was wrong. Sorry! As
               | you all know -- IANAL. It's time we remedied this stupid
               | chapter in our history. After all, _I gave similar
               | assurances to the AFS module_ when it was open sourced
               | under a GPL incompatible license in 2003. "
               | 
               | Linus's other words on the matter[0]:
               | 
               | > But one gray area in particular is something like a
               | driver that was originally written for another operating
               | system (ie clearly not a derived work of Linux in
               | origin). At exactly what point does it become a derived
               | work of the kernel (and thus fall under the GPL)?
               | 
               | > THAT is a gray area, and _that_ is the area where I
               | personally believe that some modules may be considered to
               | not be derived works simply because they weren't designed
               | for Linux and don't depend on any special Linux
               | behaviour.
               | 
               | [0]: https://lkml.org/lkml/2003/12/3/228
        
               | kaba0 wrote:
               | > wonder how hard it would be to write a compatible clean
               | room reimplementation of zfs in rust or something, from
               | the spec
               | 
               | As for every non-trivial application - almost impossible.
        
               | 0x457 wrote:
               | Not exactly ZFS in Rust, but more like a replacement for
               | ZFS in Rust: https://github.com/redox-os/tfs
               | 
               | Worked stalled, though. Not compatible, but I was working
               | on overlayfs for freebsd in rust, and it was not pleasant
               | at all. Can't imagine making an entire "real" file system
               | in Rust.
        
           | gigatexal wrote:
           | "Wastes" ram? That's a tunable my friend.
        
             | viraptor wrote:
             | https://github.com/openzfs/zfs/issues/10516
             | 
             | The data goes through two caches instead of just page cache
             | or just arc as far as I understand it.
        
             | quotemstr wrote:
             | Can I totally disable ARC yet?
        
               | throw0101a wrote:
               | zfs set primarycache=none foo/bar
               | 
               | ?
               | 
               | Though this will amplify reads as even metadata will need
               | to be fetched from disk, so perhaps "=metadata" may be
               | better.
               | 
               | * https://openzfs.github.io/openzfs-
               | docs/man/7/zfsprops.7.html...
        
               | vluft wrote:
               | I'm curious what your workflow is that not having any
               | disk caching would have acceptable performance.
        
               | 0x457 wrote:
               | A workflow where the person doesn't understand that RAM
               | isn't wasted and it just their utility to show usage is
               | wrong. Imagine being mad at file system cache being
               | stored in RAM.
        
               | quotemstr wrote:
               | The problem with ARC in ZFS on Linux is the double
               | caching. Linux already has a page cache. It doesn't need
               | ZFS to provide a second page cache. I want to store
               | things in the Linux page cache once, not once in the page
               | cache and once in ZFS's special-sauce cache.
               | 
               | If ARC is so good, it should be the general Linux page
               | cache algorithm.
        
           | mustache_kimono wrote:
           | > It's only great due to the lack of competitors in the
           | checksummed-CoW-raid category.
           | 
           |  _blinks eyes, shakes head_
           | 
           | "It's only great because it's the only thing that's figured
           | out how to do a hard thing really well" may be peak FOSS
           | entitlement syndrome.
           | 
           | Meanwhile, btrfs has rapidly gone nowhere, and, if you read
           | the comments to this PR, bcachefs would love to get to simply
           | nowhere/btrfs status, but is still years away.
           | 
           | ZFS fulfills the core requirement of a filesystem, which is
           | to store your data, such that when you read it back you can
           | be assured it was the data you stored. It's amazing we
           | continue to countenance systems that don't do this, simply
           | because not fulfilling this core requirement was once
           | considered acceptable.
        
             | Dylan16807 wrote:
             | I don't see what's entitled about the idea that "it
             | fulfills the core requirements" is enough to get it "good"
             | status but not "great" status. Even if that's really rare
             | among filesystems.
        
             | throw0101a wrote:
             | > _Meanwhile, btrfs has rapidly gone nowhere_ [...]
             | 
             | A reminder that it came out in 2009:
             | 
             | * https://en.wikipedia.org/wiki/Btrfs
             | 
             | (ext4 was declared stable in 2008.)
        
               | deepspace wrote:
               | Yes! File systems are hard. My prediction is that it will
               | be *at least* 10 years before this newfangled FS gains
               | both feature- and stability parity with BTRFS and ZFS.
               | 
               | Also, BTRFS (albeit a modified version) has been used
               | successfully in at least one commercial NAS (Synology),
               | for many years. I don't see how that counts as "gone
               | nowhere".
        
               | throw0101a wrote:
               | Are all the foot guns described described in 2021 been
               | fixed?
               | 
               | * https://arstechnica.com/gadgets/2021/09/examining-
               | btrfs-linu...
        
               | dnzm wrote:
               | Not sure about "all", but apart from that article being
               | more pissy than strictly necessary, RAID1 can now, in
               | fact survive losing ore than one disk. That is, provided
               | you use RAID1C3 or C4 (which keeps 3 or 4 copies, rather
               | than the default 2). Also, not really sure how RAID1 not
               | surviving >1 disk failure is a slight against btrfs, I
               | think most filesystems would have issues there...
               | 
               | As for the rest of the article -- the tone rubs me the
               | wrong way, and somehow considering a FS shit because you
               | couldn't be bothered to use the correct commands (the
               | scrub vs balance ranty bit) doesn't instill confidence in
               | me that the article is written in good faith.
               | 
               | I believe the writer's biggest hangup/footgunnage with
               | btrfs is still there: it's not zfs. Ymmv.
        
               | mustache_kimono wrote:
               | > Also, BTRFS (albeit a modified version) has been used
               | successfully in at least one commercial NAS (Synology),
               | for many years. I don't see how that counts as "gone
               | nowhere".
               | 
               | Excuse me for sounding glib. My point was btrfs isn't
               | considered a serious competitor to ZFS in many of the
               | spaces ZFS operates. Moreover, it's inability to do
               | RAID5/6 after years of effort is just weird now.
        
           | ilyt wrote:
           | Yeah world decided just replicating data somewhere is far
           | preferable if you want to have resilience, instead of making
           | the separate nodes more resilient.
        
         | rektide wrote:
         | Btrfs still highly recommends a raid1 mode for Metadata, but
         | for data itself, the raid-5 is fine.
         | 
         | I somewhat recall there being a little progress on trying to
         | fix the remaining "write hole" issues, in the past year or two.
         | But in general, I think there's very little pressure to do so
         | because so very many many people run raid-5 for data already &
         | it works great. Getting Metadata off raid1 is low priority, a
         | nice to have.
        
           | kiririn wrote:
           | Raid5 works ok until you scrub. Even scrubbing one device at
           | a time is a barrage of random reads sustained for days at a
           | time
           | 
           | I'll very happily move back from MD raid 5 when linear scrub
           | for parity raid lands
        
           | tremon wrote:
           | Still, even with raid1 for metadata and raid5 for data, the
           | kernel still shouts at you about it being EXPERIMENTAL every
           | time you mount such a filesystem. I understand that it's best
           | to err on the side of caution, but that notice does a good
           | job of persisting the idea that btrfs isn't ready for prime-
           | time use.
           | 
           | I use btrfs on most of my Linux systems now (though only one
           | with raid5), except for backup disks and backups volumes:
           | those I intend to keep on ext4 indefinitely.
        
         | sedatk wrote:
         | > btrfs still doesn't have reliable RAID5
         | 
         | Synology offers btrfs + RAID5 without warning the user. I
         | wonder why they're so confident with it.
        
           | bestham wrote:
           | They are running brtrfs on top of DM.
           | https://kb.synology.com/en-
           | nz/DSM/tutorial/What_was_the_RAID...
        
             | sedatk wrote:
             | Thanks for the link!
        
           | sporkle-feet wrote:
           | Synology doesn't use the btrfs raid - AIUI they layer non-
           | raid btrfs over raid LVM
        
       | IAmLiterallyAB wrote:
       | Here's a link to the Bcachefs site https://bcachefs.org/
       | 
       | I think it summarizes its features and strengths pretty well, and
       | it has a lot of good technical information.
        
         | sumtechguy wrote:
         | Does anyone know if there any good links to current benchmarks
         | between the diff types? My googlefu is only finding stuff form
         | 2019.
        
         | anentropic wrote:
         | I can't help reading this name as Bca-chefs
         | 
         | (...I realise it must be B-cache-fs)
        
           | p1mrx wrote:
           | Maybe we could call it b$fs
        
       | baobrien wrote:
       | huh, this is fun:
       | https://lore.kernel.org/lkml/ZFrBEsjrfseCUzqV@moria.home.lan...
       | 
       | There's a little x86-64 code generator in bcachefs to generate
       | some sort of btree unpacking code.
        
         | dathinab wrote:
         | This is also the point which is the most likely to cause
         | problems for this patch series (which is only fixes and utils
         | added to the kernel) and the bcachefs in general.
         | 
         | Like when you have an entry like "bring back function which
         | could make developing viruses easier (through not a
         | vulnerability by themself) related to memory management and
         | code execution" the default answer is nop .. nooop .. never.
         | (Which doesn't mean that it won't come back).
         | 
         | It seems while it's not necessary to have this it's a non-
         | neglible performance difference.
        
           | viraptor wrote:
           | It would be really nice if he posted the difference
           | with/without the optimisation for context. I hope it's going
           | to be included in the explanation post he's planning.
        
             | kzrdude wrote:
             | It looks like the code generator is only available for x86
             | anyway, so it seems niche that way. I am all about baseline
             | being good performance, not the special case.
        
               | BenjiWiebe wrote:
               | He mentions he wants to make the same type of
               | optimization for ARM, so ARM+x86 certainly wouldn't be
               | niche.
               | 
               | I wouldn't even call x86 alone niche...
        
         | Permik wrote:
         | I'll be eagerly waiting for the upcoming optimization writeup
         | mentioned here:
         | https://lore.kernel.org/lkml/ZFyAr%2F9L3neIWpF8@moria.home.l...
        
           | mastax wrote:
           | Please post it on HN because I won't remember to go looking
           | for it.
        
         | dontlaugh wrote:
         | It's bad enough that the kernel includes a JIT for eBPF. Adding
         | more of them without hardware constraints and/or formal
         | verification seems like a bad idea to me.
        
           | baobrien wrote:
           | yeah, most of the kernel maintainers in that thread seem to
           | be against it. bcachefs does seem to also have a non-code-
           | generating implementation of this, as it runs on
           | architectures other than x86-64.
        
       | sporkle-feet wrote:
       | The feature that caught my eye is the concept of having different
       | targets.
       | 
       | A fast SSD can be set as the target for foreground writes, but
       | that data will be transparently copied in the background to a
       | "background" target, i.e. a large/slow disk.
       | 
       | If this works, it will be awesome.
        
         | viraptor wrote:
         | You can also have that at block level (which is where bcache
         | itself comes from). Facebook used it years ago and I had it on
         | an SSD+HDD laptop... a decade ago at least? Unless you want the
         | filesystem to know about it, it's ready to go now.
        
           | jwilk wrote:
           | Look up --write-mostly and --write-behind options in mdadm(8)
           | man page.
           | 
           | I can't recommend such a setup though. It works very poorly
           | for me.
        
             | saltcured wrote:
             | See the lvmcache(7) manpage, which I think may be what the
             | earlier poster was thinking of. It isn't an asymmetric RAID
             | mode, but a tiered caching scheme where you can, for
             | example, put a faster and smaller enterprise SSD in front
             | of a larger and slower bulk store. So you can have a large
             | bulk volume but the recently/frequently used blocks get the
             | performance of the fast cache volume.
             | 
             | I set it up in the past with an mdadm RAID1 array over SSDs
             | as a caching layer in front of another mdadm array over
             | HDDs. It performed quite well in a developer/compute
             | workstation environment.
        
               | viraptor wrote:
               | I did mean bcache specifically.
               | https://www.kernel.org/doc/Documentation/bcache.txt
        
         | throw0101a wrote:
         | > _A fast SSD can be set as the target for foreground writes,
         | but that data will be transparently copied in the background to
         | a "background" target, i.e. a large/slow disk._
         | 
         | This is very similar in concept to (or an evolution of?) ZFS's
         | ZIL:
         | 
         | * https://www.servethehome.com/what-is-the-zfs-zil-slog-and-
         | wh...
         | 
         | * https://www.truenas.com/docs/references/zilandslog/
         | 
         | * https://www.45drives.com/community/articles/zfs-caching/
         | 
         | When this feature was first introduced to ZFS in the Solaris 10
         | days there was an interesting demo from a person at Sun that I
         | ran across: he was based in a Sun office on the US East Coast
         | where he did stuff, but had access to Sun lab equipment across
         | the US. He mounted iSCSI drives that were based in (IIRC)
         | Colorado as a ZFS poool, and was using them for Postgres stuff:
         | the performance was unsurprisingly not good. He then add a
         | local ZIL to the ZFS pool and got I/O that was not too far off
         | from some local (near-LAN) disks he was using for another pool.
        
           | seized wrote:
           | ZIL is just a fast place to write the data for sync
           | operations. If everything is working then the ZIL is never
           | read from, ZFS uses RAM as that foreground bit.
           | 
           | Async writes on a default configuration don't hit the ZIL,
           | only RAM for a few seconds then disk. Sync writes are RAM to
           | ZIL, confirm write, then RAM to pool.
        
           | ThatPlayer wrote:
           | But ZIL is a cache, and not usable for long-term storage. If
           | I combine a 1TB SSD with a 1TB HDD, I get 1TB of usable
           | space. In bcachefs, that's 2TB of usable space.
           | 
           | Bcache (not bcachefs) is more equivalent to ZIL.
        
       | harvie wrote:
       | What i really miss when compared to ZFS is ability to create
       | datasets. I really like to use ZFS subvolumes for LXC containers.
       | That way i can have separate sub-btree for each container with
       | it's own size limit without having to create partitions or LVs,
       | format the filesystem and then resize everything when i need to
       | grow the partition or even defragment fs before shrinking it.
       | With ZFS i can easily give and take disk capacity to my
       | containers without having to do any multi step operation that
       | requires close attention to prevent accidental data loss.
       | 
       | Basicaly i just state what size i want that subtree to be and it
       | happens without having to touch underlying block devices. Also i
       | can change it anytime during runtime extremely easily. Eg.:
       | 
       | zfs set quota=42G tank/vps/my_vps
       | 
       | zfs set quota=32G tank/vps/my_vps
       | 
       | zfs set quota=23G tank/vps/my_other_vps
       | 
       | btrfs can kinda do this as well, but the commands are not as
       | straighforward as in zfs.
       | 
       | update: My bad. bcachefs seems to have subvolumes now. there is
       | also some quota support, but so far the documentation is bit
       | lacking, so not yet sure how to use that and if that can be
       | configured per dataset.
        
       | layer8 wrote:
       | I parsed this as "BCA chefs" at first.
        
       | curt15 wrote:
       | For some reason VM and DB workloads are btrfs's Achilles heel but
       | ZFS seems to handle them pretty well (provided that a suitable
       | recordsize is set). How do they perform on bcachefs?
        
         | candiddevmike wrote:
         | I've never had a problem with these on BTRFS with COW disabled
         | on their directories...
        
           | pongo1231 wrote:
           | The issue is that also disables many of the interesting
           | features of BTRFS for those files. No checksumming, no
           | snapshots and no compression. In comparison ZFS handles these
           | features just fine for those kinds of files without the
           | enormous performance / fragmentation issues of BTRFS (without
           | nodatacow).
        
             | [deleted]
        
       | MisterTea wrote:
       | Another File system I am interested in is GEFS - good enough fs
       | (rather - "great experimental file shredder" until stable ;-).
       | It's based on B-epsilon trees, a data structure which wasn't
       | around when ZFS was designed. The idea is to build a ZFS like fs
       | without the size and complexity of zfs. So far its plan 9 only
       | and not production ready though there is a chance it could be
       | ported to OpenBSD and a talk was given at NYC*BUG:
       | https://www.nycbug.org/index?action=view&id=10688
       | 
       | Code: http://shithub.us/ori/gefs/HEAD/info.html
        
       | voxadam wrote:
       | You're interested in more detailed information about bcachefs I
       | highly recommend checking out _bcachefs: Principles of
       | Operation_.[0]
       | 
       | Also, the original developer of bcachefs (as well as bcache),
       | Kent Overstreet posts status updates from time to time on his
       | Patreon page.[1]
       | 
       | [0] https://bcachefs.org/bcachefs-principles-of-operation.pdf
       | 
       | [1] https://www.patreon.com/bcachefs
        
         | AceJohnny2 wrote:
         | Thanks for the links!
         | 
         | I was wondering if bcachefs is architectured with NAND-flash
         | SSD hardware in mind (as recently highlighted on HN in the "Is
         | Sequential IO Dead In The Era Of The NVMe Drive" article [1]
         | [2]), to optimize IO and hardware lifecycle.
         | 
         | Skimming through the "bcachefs: Principles Of Operation" PDF,
         | it appears the answer is no.
         | 
         | [1] https://jack-vanlightly.com/blog/2023/5/9/is-sequential-
         | io-d...
         | 
         | [2] https://news.ycombinator.com/item?id=35878961
        
           | koverstreet wrote:
           | It is. There's also plans for ZNS SSD support.
        
       ___________________________________________________________________
       (page generated 2023-05-11 23:01 UTC)