hngopher.com

       [HN Gopher] Modern storage is plenty fast, but the APIs are bad
       ___________________________________________________________________
        
       Modern storage is plenty fast, but the APIs are bad
        
       Author : harporoeder
       Score  : 317 points
       Date   : 2020-11-26 15:40 UTC (7 hours ago)
        
 (HTM) web link (itnext.io)
 (TXT) w3m dump (itnext.io)
        
       | papi_bichhu wrote:
       | For a moment, I thought it's my personal list of "I don't know
       | these"
        
       | kibwen wrote:
       | Prior discussion from /r/rust where the author is present to
       | answer questions:
       | https://www.reddit.com/r/rust/comments/k16j6x/modern_storage...
        
       | Ericson2314 wrote:
       | From the author's previous piece:
       | https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-wi...
       | 
       | > Our CTO, Avi Kivity, made the case for async at the Core C++
       | 2019 event. The bottom line is this; in modern multicore, multi-
       | CPU devices, the CPU itself is now basically a network, the
       | intercommunication between all the CPUs is another network, and
       | calls to disk I/O are effectively another. There are good reasons
       | why network programming is done asynchronously, and you should
       | consider that for your own application development too. > > It
       | fundamentally changes the way Linux applications are to be
       | designed: Instead of a flow of code that issues syscalls when
       | needed, that have to think about whether or not a file is ready,
       | they naturally become an event-loop that constantly add things to
       | a shared buffer, deals with the previous entries that completed,
       | rinse, repeat.
       | 
       | As someone that's been working on FRP related things for a while
       | now, this feels very vindicating. :)
       | 
       | I few like as recently as a few years ago, the systems world was
       | content with it's incremental hacks, but now the gap between the
       | traditional interfaces and hardware realities has become too
       | much, and bigger redesigning is afoot.
       | 
       | Excited for what emerges!
        
         | mehrdadn wrote:
         | > in modern multicore, multi-CPU devices, the CPU itself is now
         | basically a network, the intercommunication between all the
         | CPUs is another network, and calls to disk I/O are effectively
         | another.
         | 
         | Interesting take, and NUMA CPUs have felt networked to me when
         | I've used them, but typical multicore UMA CPUs sure haven't...
         | is there a reason to believe this will change (or already has),
         | or did the author mean to only talk about NUMA?
        
           | ddorian43 wrote:
           | Both, even multicore. Same ideas are used by Red Panda (open
           | source kafka++ clone).
        
           | jeffbee wrote:
           | I guess it depends on what you consider to be "typical". If
           | you look at a many-core chip from AMD you'll see a gradient
           | of access latency from one core to another. You'll see the
           | same on any Intel Skylake-X descendant, although the slope of
           | that gradient is less. Your software will need to be very
           | highly optimized already before you start to sweat the
           | difference, though.
        
             | masklinn wrote:
             | Here's what jeffbee is talking about:
             | https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-
             | di...
        
               | jeffbee wrote:
               | Yep. If you are sweating a microsecond, 100 nanoseconds
               | is significant chunk of your budget. For this and
               | possibly other reasons, a many-core CPU isn't always a
               | great choice for hosted storage. If your goal is to
               | export NVMe blocks over a network interface, you might be
               | better off with an easier-to-program 4- or 8-core CPU. I
               | don't like seeing 128 cores and a bunch of NVMe devices
               | in the same box because it just causes trouble.
        
               | magicalhippo wrote:
               | With p2pdma[1], it seems you don't even need a fancy CPU
               | to push GB/s.
               | 
               | [1]: https://www.youtube.com/watch?v=LDOlqgUZtHE (9:40)
        
         | hinkley wrote:
         | > that have to think about whether or not a file is ready, they
         | naturally become an event-loop that constantly add things to a
         | shared buffer, deals with the previous entries that completed,
         | rinse, repeat.
         | 
         | This is the exception, not the rule, and it bugs me when APIs
         | default to this. Most consumers of data are not looking at the
         | stream of data, and in many cases where streaming is what I
         | want, there are tools and APIs for handling that outside of my
         | application logic. Much of the time I'm dealing with units of
         | data only after the entire unit has arrived. Because if the
         | message is not complete there is no forward progress to be
         | made.
         | 
         | My tools should reflect that reality, not what's quickest for
         | the API writers to create.
         | 
         | In fact, if I remember my queuing theory properly,
         | responsiveness is improved if the system prioritizes IO
         | operations that can be finished (eg, EOS, EOF) over processing
         | buffers for one that is still in the middle, which can't happen
         | with an event stream abstraction.
        
         | pjmlp wrote:
         | The problem is forcing everyone to program in asynchronous way.
         | 
         | WinRT tried to go down that route (only asynchronous APIs), to
         | drive developers into that path, but eventually they had to
         | support synchronous as well due to the received resistence.
        
           | justicezyx wrote:
           | Before Herb Sutter's free lunch is over, not many writes in
           | multi-threading code. Then today, everyone is writing multi-
           | threading code, one way or another; and majority indirectly
           | in newer languages like Go and Rust, or better frameworks
           | like Actors, message passing, coroutines, and some noble
           | souls who are capable enough, in classic pthread and other
           | threading APIs.
           | 
           | Of course everyone is going to program in an async way;
           | that's how nature works.
           | 
           | But it certainly will not be in a fashinon that is repulsive
           | to you, just give it some time. Maybe 10 years.
        
             | pjmlp wrote:
             | Agreed, it is also why Microsoft was one of the biggest
             | contributors to the co-routines support in C++, and have
             | frameworks like Orleans and Coyote.
        
       | amelius wrote:
       | What I hate about Unix filesystems: the fact that you can't take
       | a drive, put it in another computer and have permissions
       | (user/group-ids) working instantly. Same for sharing over nfs.
       | 
       | Of course, people have tried to solve this, but I think not well
       | enough. It's a huge amount of technical debt right there in the
       | systems we use every day.
        
         | bufferoverflow wrote:
         | Because permissions is an OS-enforced feature, not the
         | filesystem feature. If you have access to the drive,
         | permissions are meaningless.
        
           | amelius wrote:
           | Permissions (and ownership info) can be useful even if you
           | have complete access to a filesystem.
           | 
           | By the way, assume you have root permission. How would you
           | replace a single file in a random tar-file, without changing
           | any of the permissions/userids/groupids inside the tar-file?
           | You can't untar it because the users inside the tar file
           | don't correspond with the ones on your system. So, you'll
           | have to use special tools, which is (only) one demonstration
           | of the inadequacy of the permissions mechanism of our
           | filesystems.
        
       | zackmorris wrote:
       | I agree with the premise, but disagree with the conclusion.
       | 
       | For a little background, my first computer was a Mac Plus around
       | 1985, and I remember doing file copy tests on my first hard drive
       | (an 80 MB) at over 1 MB/sec. If I remember correctly, SCSI could
       | do 5 MB/sec copies clear back in the mid-80s. So until we got
       | SSD, hard drive speed stayed within the same order of magnitude
       | for like 30 years (as most of you remember):
       | 
       | http://chrislawson.net/writing/macdaniel/2k1120cl.shtml
       | 
       | So the time to take our predictable deterministic synchronous
       | blocking business logic into the maze of asynchronous promise
       | spaghetti was a generation ago when hard drive speeds were two
       | orders of magnitude slower than today.
       | 
       | In other words, fix the bad APIs. Please don't make us shift
       | paradigms.
       | 
       | Now if we want to talk about some kind of compiled or graph-
       | oriented way of processing large numbers of files performantly
       | with some kind of async processing internally, then that's fine.
       | Note that this solution will mirror whatever we come up with for
       | network processing as well. That was the whole point of UNIX in
       | the first place, to treat file access and network access as the
       | same stream-oriented protocol. Which I think is the motive behind
       | taking file access into the same problematic async domain that
       | web development is having to deal with now.
       | 
       | But really we should get the web back to the proven UNIX/Actor
       | model way of doing things with synchronous blocking I/O.
        
       | tobias3 wrote:
       | Intuitively one should be able to approach the max speed for
       | sequential reads via some tuning (queue/read_ahead_kb) even with
       | the traditional, blocking posix interface. This would require a
       | large enough read-ahead and large enough buffer size. Not
       | poisoning the page cache/manually managing the page cache is an
       | orthogonal issue and only relevant for some applications (and the
       | additional memory copy barely makes a difference in OPs post).
       | 
       | One advantage of using high level (Linux) kernel interfaces is
       | that this "automatically" gets faster with newer Linux versions
       | without a need of large application level changes. Maybe in a few
       | years we'll have an extra cache layer, or it stores to persistent
       | memory now. Linux will (slowly) improve and your application with
       | it. This won't happen if it is specifically tuned for Direct I/O
       | with Intel Optane in 2020.
       | 
       | But yeah, random IO is (currently) another issue, and as said the
       | usual advice is to avoid them. And with the old API this still
       | holds. If one currently wants fast random IO one needs to use
       | io_uring/aio (with Direct-IO) or just live with the performance
       | not being optimal and hope that the page cache does more good
       | than bad (like Postgresql).
        
         | jorangreef wrote:
         | The page cache is not reliable, and actually does more bad than
         | good, especially in the case of PostgreSQL:
         | 
         | https://www.usenix.org/conference/atc20/presentation/rebello
        
       | [deleted]
        
       | Thaxll wrote:
       | How those modern API runs on old HW?
        
       | quelsolaar wrote:
       | This is such a big deal! The assumptions made when IO APIs where
       | designed are so out-of-step with today's hardware that it really
       | is a time to have a big rethink. In graphics, the last 20 years
       | of API development have very much been focused on harnessing a
       | GPU that have again and again outgrown the CPUs ability to feed
       | it. So much have been learned, and we really need to apply this
       | to both storage and networking.
        
       | ivoras wrote:
       | > ...misconceptions... Yet if you skim through specs of modern
       | NVMe devices you see commodity devices with latencies in the
       | microseconds range and several GB/s of throughput supporting
       | several hundred thousands random IOPS. So where's the disconnect?
       | 
       | Whoa there... let's not compare devices with 20+ GB/s and
       | latencies in nanosecond ranges which translate to half a dozen
       | giga-ops per second (aka RAM) with any kind of flash-based
       | storage just yet.
        
         | hinkley wrote:
         | We've been using Ethernet cards for storage because the network
         | round trip to RAM over TCP/IP on another machine in the same
         | rack is far cheaper than accessing local storage. Latency
         | compared to that option is likely the most noteworthy
         | performance gain.
         | 
         | My understanding of distributed computing history is that the
         | last time network>local storage happened was in the 80's, and
         | most of the rest of the history of computing, moving the data
         | physically closer to the point of usage has always been faster.
         | 
         | Just as then, we've taken a pronounced software architecture
         | detour. This one has lasted much longer, but it can't really
         | last forever. With this new generation of storage, we'll
         | probably see a lot of people trotting out 90's era system
         | designs as if they are new ideas rather than just regression to
         | the mean.
         | 
         | Same as it ever was.
        
         | wtallis wrote:
         | The article isn't exactly conflating RAM and flash; if it were,
         | the conclusions would be very different. A synchronous blocking
         | IO API is fine if you're working with nanosecond latencies of
         | RAM, or with storage that's as painfully slow and serial as a
         | mechanical hard drive.
         | 
         | Flash is special in that its latency is still considerably
         | higher than that of DRAM, but its _throughput_ can get
         | reasonably close once you have more than a handful of SSDs in
         | your system (or if you 're willing to compare against the DRAM
         | bandwidth of a decade-old PC). Extracting the full throughput
         | from a flash SSD despite the higher-than-DRAM latency is what
         | requires more suitable APIs (if you're doing random IO;
         | sequential IO performance is easy).
        
           | gogopuppygogo wrote:
           | Sustainable read/write speeds are also different than peak on
           | SSD vs RAM.
        
             | wtallis wrote:
             | True, but that applies more to writes than reads. Most
             | real-world workloads do a lot more reads than writes, and
             | what writes they do perform can usually tolerate a lot of
             | buffering in the IO stack to further reduce the overall
             | impact of low write performance on the underlying storage.
        
         | jorangreef wrote:
         | The advertised bandwidth for RAM is not actually what you get
         | per-core, which is what you care about in practice.
         | 
         | If you want to know the upper bound on your per-core RAM
         | bandwidth:
         | 
         | 64 bytes (the size of a cache line) * 10 slots (in a CPU core's
         | LFB or line fill buffer) / 100ns (the typical cost of a cache
         | miss) * 1000000 * 1000 (to convert ns to ms to seconds) =
         | 6400000000 bytes per second = 5.96 GiB per second RAM bandwidth
         | per core
         | 
         | There's no escaping that upper bound per core.
         | 
         | Nanosecond RAM latencies don't help much when you're capped by
         | the line fill buffer and queuing delay kicks in spiking your
         | cache miss latencies. You can only fetch 10 lines at a time per
         | core and when you exceed your 5.96 GiB per second budget your
         | access times increase.
         | 
         | If you compare with NVMe SSD throughput plus Direct I/O plus
         | io_uring, around 32 GIB per second and divide that by 10
         | according to the difference in access latencies, then I think
         | the author is about right on target. The point they are making
         | is valid: it's the same order of magnitude.
        
           | wmf wrote:
           | What about prefetching? Tiger Lake gets over 20 GB/s per
           | core. https://www.anandtech.com/show/16084/intel-tiger-lake-
           | review...
        
             | jorangreef wrote:
             | Beats me!
        
             | throwaway_pdp09 wrote:
             | From your link
             | 
             | > In the DRAM region we're actually seeing a large change
             | in behaviour of the new microarchitecture, with vastly
             | improved load bandwidth from a single core, increasing from
             | 14.8GB/S to 21GB/s
             | 
             | Yeah, that's odd. But the article's really about cache, so
             | maybe it's a mistake. Next para says
             | 
             | > More importantly, memory copies between cache lines and
             | memory read-writes within a cache line have respectively
             | improved from 14.8GB/s and 28GB/s to 20GB/s and 34.5GB/s.
             | 
             | so it looks like it's talking about cache not ram but...
             | _shrug_
        
           | sgtnoodle wrote:
           | While I was in the hospital ICU earlier this year, I promised
           | myself I would build a zen 3 desktop when it came out despite
           | my 10 year old desktop still working just fine.
           | 
           | I've since bought all the pieces but the CPU; they are all
           | sold out. So I got a 6 core 3600XT in the interim. I bought
           | fairly high binned RAM and overclocked it to 3600Mhz, and was
           | surprised to cap out at about 36GB/s throughput. Your 6GiB/s
           | per core explanation checks out for me!
        
             | jorangreef wrote:
             | Cool! I had a similar empirical experience working on a
             | Cauchy Reed-Solomon encoder awhile back, which is
             | essentially measuring xor speed, but I just couldn't get it
             | past 6 GiB/s per core either, until I guessed I was hitting
             | memory bandwidth limits. Only a few weeks ago I stumbled on
             | the actual formula to work it out!
        
           | throwaway_pdp09 wrote:
           | > capped by the line fill buffer and queuing delay kicks in
           | spiking your cache miss
           | 
           | could you point me to a little reading material on this? I
           | know what an LFB is, more or less, but what queueing delay,
           | an dhow does that relate to cache misses? Thanks.
        
             | jorangreef wrote:
             | Sure, I'm still pretty fuzzy on these things, but queueing
             | delay is Little's law:
             | https://en.wikipedia.org/wiki/Little's_law
             | 
             | It means if a system can only do X of something per second,
             | then if you push the system past that, new arriving stuff
             | has to wait on existing work in the queue, and things take
             | longer than if the queue was empty. You can think of it
             | like a traffic jam and it applies to most systems.
             | 
             | For example, our local radio station here in Cape Town
             | loves to talk about "queuing traffic" when they do the 8am
             | traffic report, and I always think of Little's law.
             | 
             | Bufferbloat is another example of queueing delay, e.g.
             | where you fill the buffer of your network router say with a
             | large Gmail attachment upload and spike the network ping
             | times for everyone else sharing the same WiFi.
             | 
             | Here is where I got the per-core bandwidth calculation
             | from: https://www.eidos.ic.i.u-tokyo.ac.jp/~tau/lecture/par
             | allel_d...
        
               | throwaway_pdp09 wrote:
               | Appreciated, thanks
        
         | gravypod wrote:
         | Depending on the storage technology the comparison to RAM is
         | not that far off. Intel is trying to market it that way any way
         | [0]. It's obviously not RAM but it's not the <500GB 5200RPM
         | SATA 3GB/s disk I started programming on.
         | 
         | [0] - https://www.intel.com/content/www/us/en/architecture-and-
         | tec...
        
           | smcameron wrote:
           | Yeah, back in 2014, I worked at HP on storage drivers for
           | linux, and we got 1 million IOPS (4k random reads) on a
           | single controller, with SSDs, but we had to do some fairly
           | hairy stuff. This was back when NVME was new and we were
           | trying to do SCSI over PCIe. We set up multiple ring buffers
           | for command submission and command completion, one each per
           | CPU, and pinned threads to CPUs and were very careful to
           | avoid locking (e.g. spinlocks, etc.). I think we also had to
           | pin some userland processes to particular CPUs to avoid NUMA
           | induced bottlenecks.
           | 
           | The thing is, up until this point, for the entire history of
           | computers, storage was so relatively slow compared to memory
           | and the CPU that drivers could be quite simple, chuck
           | requests and completions into queues managed by simple
           | locking, and the fraction of time that requests spent inside
           | the driver would still be negligible compared to the time
           | they spent waiting for the disks. If you could theoretically
           | make your driver infinitely fast, this would only amount to
           | maybe a 1% speedup. So there was no need to spend a lot of
           | time thinking about how to make the driver super efficient.
           | Until suddenly there was.
        
             | smcameron wrote:
             | Oh yeah, iirc, the 1M IOPS driver was a block driver. For
             | the SCSI over PCIe stuff, there was the big problem at the
             | time that the entire SCSI layer in the kernel was a
             | bottleneck, so you could make the driver as fast as you
             | wanted, but your requests were still coming through a
             | single queue managed by locks, so you were screwed. There
             | was a whole ton of work done by Christoph Hellwig, Jens
             | Axboe and others to make the SCSI layer "multiqueue" around
             | that time to fix that.
        
       | jeroenhd wrote:
       | I suppose _modern_ storage is fast, but how many servers are
       | running on storage this modern? None of mine are and my work dev
       | machine is still rocking a SATA 2.5" SSD.
       | 
       | We're probably still a few years off from being able to switch to
       | this fast I/O yet. With the new game consoles switching over to
       | PCIe SSDs I expect the price of NVMe drives to drop over the next
       | few years until they're cheap enough that the majority of
       | computers are running NVMe drives.
       | 
       | Even with SATA drives like mine though, there's really not that
       | much performance loss from doing IO operations. I've run my OS
       | with 8GiB of SSD swap in active use during debugging and while
       | the stutters are annoying and distracting, the computer didn't
       | grind to a halt like it would with spinning rust. Storage speed
       | has increased massively in the last five years, for the love of
       | god fellow developers, please make use of it when you can!
       | 
       | That said, deferring IO until you're done still makes sense for
       | some consumer applications because cheap laptops are still being
       | sold with hard drives and those devices are probably the minimum
       | requirement you'll be serving.
        
         | wtallis wrote:
         | > I expect the price of NVMe drives to drop over the next few
         | years until they're cheap enough that the majority of computers
         | are running NVMe drives.
         | 
         | Price no longer has anything to do with it. PC OEMs are simply
         | not shipping SATA SSDs any more, and major drive vendors have
         | started to discontinue their client (OEM) SATA SSD product
         | lines. We're just waiting for the SATA-based PC install base to
         | be retired.
        
           | im3w1l wrote:
           | My mobo has many more SATA slots than M.2. slots. I expect
           | there will be hybrid systems for quite a while.
        
             | wtallis wrote:
             | One SSD is sufficient for almost all consumer systems. The
             | only reason to want more than two SSDs is if you're re-
             | using at least one old tiny SSD in a new machine. SATA
             | ports will stick around in desktops only for the sake of
             | hard drives. There may be a few niches left where using
             | several SATA SSDs in a workstation still makes some kind of
             | sense, and obviously not all server platforms have migrated
             | to NVMe yet. But as far as influencing the direction and
             | design of consumer systems, SATA SSDs have only slightly
             | more relevance than optical disc drives.
        
               | p1necone wrote:
               | Drive price doesn't scale linearly with capacity, you can
               | save a fair bit of money sticking with multiple smaller
               | capacity drives vs one big one.
        
           | aidenn0 wrote:
           | I have 8 SATA SSDs in my workstation; are there motherboards
           | that could run a similar NVMe setup?
        
             | magicalhippo wrote:
             | You can get NVME PCIe cards which has on-board PCIe switch.
             | Random example, here's[1] one with 4 M.2 slots sharing an
             | x8 PCIe slot.
             | 
             | Obviously sustained bandwidth is limited to that of
             | effectively two NVME devices, but if you're doing lots of
             | random I/O I guess it's a win.
             | 
             | [1]: https://www.aliexpress.com/item/4000034598072.html
        
             | bhewes wrote:
             | Sure you could run 24 NVMes with highpoint pcie4 raids on a
             | trx40 board. But then most still have like 10 sata ports so
             | you can run those as well. It will be great when sata is
             | replaced by U.2 but who knows when that happens.
        
             | wtallis wrote:
             | I wasn't including the workstation market when I referred
             | to what PC OEMs are doing.
             | 
             | Are you using 8 _consumer_ SATA SSDs in your workstation?
             | Is it for the sake of increased capacity, or for the sake
             | of increased performance? Because it 's pretty easy now to
             | match the performance of an 8-drive SATA RAID-0 with a
             | single NVMe drive, but 8TB consumer NVMe SSDs are still 50%
             | more expensive than 8TB consumer SATA SSDs.
             | 
             | (Also, even 8 SATA ports is above average for consumer
             | motherboards; it looks like about 17% of the retail desktop
             | motherboard models currently on the market have at least 8
             | SATA ports.)
        
               | aidenn0 wrote:
               | Increased capacity. I started with 4 spinning disks,
               | replaced them with SSDs a while ago and then grew it to
               | 8.
        
             | eightysixfour wrote:
             | Yes, using PCIe expansion cards. I know of an AMD board
             | that ships with 5 (3 on the board, 2 with a PCIe card).
             | Could easily add more.
        
               | fibers wrote:
               | is this with threadripper boards?
        
               | wtallis wrote:
               | 3 M.2 slots is common even on AMD's mainstream X570 and
               | B550 platforms. I don't know if any of those motherboards
               | also bundle riser cards for further M.2 PCIe SSDs, but
               | they do support PCIe bifurcation so you can run your GPU
               | at PCIe 4.0 x8 and use the second x16/x8 slot to run two
               | more SSDs in a passive riser purchased separately.
        
               | eightysixfour wrote:
               | No, just an x570, the MSI "Godlike". You can also just
               | buy PCIe cards with M2 slots for drives.
        
             | digikata wrote:
             | Right now it's a bit more specialized to storage oriented
             | server platforms that can run in the 10-40 NVMe devices.
             | You get this sort of imbalance where any one or two high
             | performance NVMe devices at full throughput can push more
             | I/O than a single high end network link
        
         | rcxdude wrote:
         | Well, if the whole async 'I/O is the bottleneck' principle
         | which was the refrain from a few years ago is actually true,
         | then servers running databases should be focusing on upgrading
         | their storage to these levels, since that's where the most bang
         | for buck comes from in terms of performance gain is (of course
         | now the big thing is running everything on clouds like AWS
         | where everything is dog slow and really expensive, so perhaps
         | it doesn't actually matter). (In fact the main reason for the
         | API changes covered in the article is because the CPU and RAM
         | can no longer run laps around storage).
        
         | scribu wrote:
         | > how many servers are running on storage this modern?
         | 
         | AWS allows you to provision EC2 instances with NVMe, without
         | much fanfare. Cost is only ~20% more than SATA.
        
       | kazinator wrote:
       | This is a really poor article. Only in very rare circumstances
       | can developers change the API's. API's are not "bad"; they are
       | built to various important requirements. Only some of those
       | requirements have to do with performance.
       | 
       | > _"Well, it is fine to copy memory here and perform this
       | expensive computation because it saves us one I /O operation,
       | which is even more expensive"._
       | 
       | "I/O operation" in fact refers to the API call, not to the raw
       | hardware operation. If the developer measured this and found it
       | true, how can it be a misconception? It may be caused by a "bad"
       | I/O API, but so what? The API is what it is.
       | 
       | API's provide one requirement which is stability: keeping
       | applications working. That is king. You can't throw out API's
       | every two years due to hardware advancements.
       | 
       | > _"If we split this into multiple files it will be slow because
       | it will generate random I /O patterns. We need to optimize this
       | for sequential access and read from a single file"_
       | 
       | Though solid state storage doesn't have track-to-track seek
       | times, the sequential-access-fast rule of thumb has not become
       | false.
       | 
       | Random access may have to wastefully read larger blocks of the
       | data than are actually requested by the application. The unused
       | data gets cached, but if it's not going to be accessed any time
       | soon, it means that something else got wastefully bumped out of
       | the cache. Sequential access is likely to make use of an entire
       | block.
       | 
       | Secondly, there is that API again. The underlying operating
       | system may provide a read-ahead mechanism which reduces its own
       | overheads, benefiting the application which structures its data
       | for sequential access, even if there is no inherent hardware-
       | level benefit.
       | 
       | If there is any latency at all between the application and the
       | hardware, and if you can guess what the application is going to
       | read next, that's an opportunity to improve performance. You can
       | correctly guess what the application will read if you guess that
       | it is doing a sequential read, and the application makes that
       | come true.
        
         | cahooon wrote:
         | I didn't get the impression that the author was suggesting to
         | throw out the old APIs. It seems to me like the article is a
         | proof of concept of new approaches that could be added as new
         | APIs, only expected to be used by people who need them, using
         | an approach that takes advantage of modern storage technology.
         | 
         | > "Random access may have to wastefully read larger blocks of
         | the data than are actually requested by the application. The
         | unused data gets cached, but if it's not going to be accessed
         | any time soon, it means that something else got wastefully
         | bumped out of the cache. Sequential access is likely to make
         | use of an entire block."
         | 
         | I may have misread it, but I thought he addressed this in the
         | article.
         | 
         | > "Random access files take a position as an argument, meaning
         | there is no need to maintain a seek cursor. But more
         | importantly: they don't take a buffer as a parameter. Instead,
         | they use io_uring's pre-registered buffer area to allocate a
         | buffer and return to the user. That means no memory mapping, no
         | copying to the user buffer -- there is only a copy from the
         | device to the glommio buffer and the user get a reference
         | counted pointer to that. And because we know this is random
         | I/O, there is no need to read more data than what was
         | requested."
        
       | Cojen wrote:
       | I found this to be a good read, but I wish the author discussed
       | the pros/cons of bypassing the file system and using a block
       | device with direct I/O. I've found that with Optane drives the
       | performance is high enough that the extra load from the file
       | system (in terms of CPU) is significant. If the author was using
       | a file system (which I assume is the case) which was it?
        
       | bob1029 wrote:
       | One thing I have started to realize is that best case latency of
       | an NVMe storage device is starting to overlap with areas where
       | SpinWait could be more ideal than an async/await API. I am mostly
       | advocating for this from a mass parallel throughput perspective,
       | especially if batching is possible.
       | 
       | I have started to play around with using LMAX Disruptor for
       | aggregating a program's disk I/O requests and executing them in
       | batches. This is getting into levels of throughput that are
       | incompatible with something like what the Task abstractions in
       | .NET enable. The public API of such an approach is synchronous as
       | a result of this design constraint.
       | 
       | Software should always try to work with the physical hardware
       | capabilities. Modern SSDs are most ideally suited to arrangements
       | where all data is contained in an append-only log with each batch
       | written to disk representing a consistent snapshot. If you are
       | able to batch thousands of requests into a single byte array of
       | serialized modified nodes, you can append this onto disk so much
       | faster than if you force the SSD to make individual writes per
       | new/modified entity.
        
         | wtallis wrote:
         | On Linux, it's already a NVMe driver option to enable polling
         | for (high priority) IO completion rather than sleeping until an
         | interrupt. The latency of handling an interrupt and doing a
         | couple of context switches is higher than the best-case latency
         | for fast SSDs. The io_uring userspace API also has a polling
         | mode.
        
       | mehrdadn wrote:
       | I liked most of the piece, but some bits rubbed me the wrong way:
       | 
       | > I was taken by surprise by the fact that although every one of
       | my peers is certainly extremely bright, most of them carried
       | misconceptions about how to best exploit the performance of
       | modern storage technology leading to suboptimal designs, even if
       | they were aware of the increasing improvements in storage
       | technology.
       | 
       | > In the process of writing this piece I had the immense pleasure
       | of getting early access to one of the next generation Optane
       | devices, from Intel.
       | 
       | The entire blog post is complaining about how great engineers
       | have misconception about modern storage technology and yet to
       | prove it the author had to obtain benchmarks from _early_ access
       | to _next-generation_ devices...?! And to top it off, from this we
       | conclude  "the disconnect" is due to the _APIs_? Not, say, from
       | the possibility that such blazing-fast components may very well
       | not even _exist_ in users ' devices? I'm not saying the
       | conclusions are wrong, but the logic surely doesn't follow... and
       | honestly it's a little tasteless to criticize people's
       | understanding if you're going to base the criticism on things
       | they in all likelihood don't even have access to.
        
         | arka2147483647 wrote:
         | I read that as that it was what he had at hand when running the
         | tests.
         | 
         | Also, if the speed and features are available as professional
         | grade devices today, it will be available everywhere in a few
         | years.
        
           | Miraste wrote:
           | Optane has been commercially available for five years already
           | and it's not used in any device I'm aware of. Assuming it
           | will find broad adoption at this point seems like a bad bet.
        
             | stingraycharles wrote:
             | I know a few optane deployments in finance, but other than
             | that, it seems incredibly difficult to justify the steep
             | price.
        
             | mehrdadn wrote:
             | 3 years I think: https://en.wikipedia.org/wiki/3D_XPoint
             | 
             | > It was announced in July 2015 and is available on the
             | open market under brand names Optane (Intel) and
             | subsequently QuantX (Micron) since April 2017.
             | 
             | For comparison, look at how many decades it took SSDs to
             | become commonplace: https://en.wikipedia.org/wiki/Solid-
             | state_drive#Flash-based_...
        
             | echlebek wrote:
             | Consumer NVMe devices can deliver GB/s I/O and hundreds of
             | thousands of iops. The article's point doesn't hinge on
             | Optane at all.
        
               | mistrial9 wrote:
               | please ask my several NVMe devices to take notice !
               | actual performance under Linux OS is far less than that,
               | here
        
               | snovv_crash wrote:
               | I just tested my laptop with the Ubuntu benchmark tool on
               | the partition editor. 3.5GB/s read on 100MB chunks.
        
               | magicalhippo wrote:
               | Single-thread, single-queue performance is much lower
               | than the max with good NVMe devices.
               | 
               | With increased concurrency and deeper queues, my Samsung
               | 960 Pro which has been running my Windows 10 desktop for
               | several years still can do 294k random 4k reads IOPS, and
               | 2.5GB/s sequential read.
        
         | rtkwe wrote:
         | Yeah how many people are running apps on servers served at all
         | or even partially by NVMe SSDs? Where I work for our on prem
         | stuff it's basically all network storage.
        
           | sleepydog wrote:
           | For network storage some of his points are even stronger.
           | Sure, the page cache becomes more useful as latency goes up
           | but it also becomes more important to send more I/O at once,
           | something that is hard to do with blocking APIs like read(2)
           | and write(2). The page cache is pretty good at optimizing
           | sequential I/O to do this, but not random I/O or workloads
           | where you need to sync().
        
           | danuker wrote:
           | > network storage
           | 
           | Do you mean cloud storage? as in, other people's computers?
        
             | karamanolev wrote:
             | I conjecture he means SANs, iSCSI, NFS, Fibre Channel and
             | other on-prem, but still not local to the server where the
             | compute is running.
        
             | aden1ne wrote:
             | It probably means NFS.
        
             | mikepurvis wrote:
             | It's a pretty common pattern to have a fleet of big beefy
             | VM hosts all backed by a single giant SAN on a 10gbe
             | switch. This lets you do things like seamlessly migrate a
             | VM from one host to another, or do a high availability
             | thing with multiple synchronized instances and automatic
             | failover (VMWare called this all "vMotion"). In any case,
             | lots of bandwidth to the storage, but also high latency, at
             | least relative to a locally-connected SATA or PCIe SSD.
             | 
             | So yeah, if that's your setup, you don't have much of an
             | option in between your SAN and allocating an in-machine
             | ramdisk, which will be super fast and low latency, but also
             | extremely high cost.
        
               | tpurves wrote:
               | Why not consider nVME in this case then as cheaper than
               | RAM, slower than RAM, but faster than network storage? I
               | don't know how you handle concurrency btwn VMs or
               | virtualize that storage, but there must be some standard
               | for that?
        
               | mikepurvis wrote:
               | I think a lot of it depends what the machines are used
               | for. I'm not actually the IT department, but I believe in
               | my org, we started out with a SAN-backed high
               | availability cluster, because the immediate initial need
               | was getting basic infrastructure (wiki, source control,
               | etc) off of a dedicated machine that was a single point
               | of failure.
               | 
               | But then down the road a different set of hosts were
               | brought online that had fast local storage, and those
               | were used for short term, throwaway environments like
               | Jenkins builders, where performance was far more
               | important than redundancy.
        
               | tonyarkles wrote:
               | I'm laughing a little bit because an old place I used to
               | work had a similar setup. The SAN/NAS/whatever it was was
               | pretty slow, provisioning VMs was slow, and as much as we
               | argued that we didn't need redundancy for a lot of our
               | VMs (they were semi-disposable), the IT department
               | refused to give us a fast non-redundant machine.
               | 
               | And then one day the SAN blew up. Some kind of highly
               | unlikely situation where more disks failed in a 24h
               | period than it could handle, and we lost the entire
               | array. Most of the stuff was available on tapes, but
               | rebuilding the whole thing resulted in a significant
               | period of downtime for everyone.
               | 
               | It ended up being a huge win for my team, since we had
               | been in the process of setting up Ansible scripts to
               | provision our whole system. We grabbed an old machine and
               | had our stuff back up and running in about 20 minutes,
               | while everyone else was manually reinstalling and
               | reconfiguring their stuff for days.
        
               | mikepurvis wrote:
               | Ha, that's awesome. Yeah, for the limited amount of stuff
               | I maintain, I really like the simple, single-file Ansible
               | script-- install these handful of packages, insert this
               | config file, set up this systemd service, and you're
               | done. I know it's a lot harder for larger, more
               | complicated systems where there's a lot of configuration
               | state that they're maintaining internal to themselves and
               | they want you to be setting up in-band using a web gui.
               | 
               | I experienced this recently trying to get a Sentry 10
               | cluster going-- it's now this giant docker-compose setup
               | with like 20 containers, and so I'm like "perfect, I'll
               | insert all the config in my container deployment setup
               | and then this will be trivially reproducible." Nope,
               | turns out the particular microservice that I was trying
               | to configure only uses its dedicated config file for
               | standalone/debugging purposes; when it's being launched
               | as part of the master system, everything is passed in at
               | runtime and can only be set up from within the main app.
               | Le sigh.
        
             | [deleted]
        
           | digikata wrote:
           | There are a series of hardware updates and software
           | bottlenecks to work through before access becomes more
           | common, the performance will bubble up from below starting
           | with more widespread NVMe devices, then faster NVMe over
           | fabrics hardware will become more common, and drivers,
           | hypervisors, filesystems, and storage apps will likely have
           | to rethink things to re-optimize. That means different times
           | in terms of showing up in the cloud/on-prem/etc.
        
           | smcleod wrote:
           | You can do pretty amazing things with well designed (onprem)
           | networked storage with NVMe drives/arrays.
           | 
           | I replaced the company I was working with at the time's
           | traditional "enterprise" HPE SANs with standard linux servers
           | running a mix of NVMe and SATA SSDs that provided highly
           | available, low latency and decent throughput iSCSI via
           | network.
           | 
           | Gen 1 back in 2014/2015 did something like 70K random 4k
           | read/write IOP/s per VM (running on Xen back then) and would
           | just keep scaling till you hit the clusters 4M~ IOP/s limit
           | (minus some overhead obviously).
           | 
           | Gen 2 provided between 100K and 200K random 4k to each VM to
           | a limit of about 8M~ on the underlying units (which again
           | were very affordable and low maintenance).
           | 
           | This provided very good storage performance (latency,
           | throughput and fast / minimally if at all disruptive fail-
           | over and recovery) for our apps, some of them were written in
           | highly blocking Python code and needed to be rewritten async
           | to get the most out of it, but it made a _huge_ (business
           | changing) difference and saved us an insane amount of money.
           | 
           | These days I've moved into consulting and all the work I do
           | is on GCP and AWS but I do miss the hands on high performing
           | gear like that.
           | 
           | Old stuff now but the links are https://www.dropbox.com/s/rdo
           | jhb399639e4k/lightning_san.pdf?... and
           | https://smcleod.net/tech/2015/07/24/scsi-benchmarking/ and
           | there's a few other now quite dated posts on there.
        
           | charrondev wrote:
           | We did a big upgrade upgrade a couple years ago moving all of
           | our DBs onto NVMe SSDs. We get significant improvements to
           | our query times.
           | 
           | Fast SSDs are pretty cheap nowadays.
        
           | PaulDavisThe1st wrote:
           | There's more to life than "apps on servers".
           | 
           | People doing audio/video work with lots of input streams can
           | also max out disk I/O throughput (quite easily before NVMe
           | SSDs; not so much anymore).
        
           | simcop2387 wrote:
           | This is starting to change a bit because of things like the
           | DPUs that companies are making. Basically it's an intelligent
           | PCI-e <=> network bridge that lets you emulate/share PCI-e
           | devices on the host while the actual hardware (NVMe storage,
           | GPU, etc.) is located elsewhere. This lets you reconfigure
           | the host in software without having to physically change the
           | hardware in the servers itself. It also lets you change the
           | way you have things in the server rack since everything
           | doesn't need to be able to physically fit into every other
           | server case.
           | 
           | EDIT: informative article about DPUs,
           | https://www.servethehome.com/what-is-a-dpu-a-data-
           | processing...
        
       | blindm wrote:
       | Also: Modern storage is plenty fast, but also not reliable for
       | long term use.
       | 
       | That is why I buy a new SSD every year and clone my current (worn
       | out) SSD to the new one. I have several old SSDs that started to
       | get unhealthy, well, according to my S.M.A.R.T utility that I
       | used to check them. I could probably get away with using an SSD
       | for another year, but will not risk the data loss. Anyone else do
       | this?
        
         | tjoff wrote:
         | The solution for this problem is raid. Way cheaper and far
         | superior in terms of reliability than your solution.
         | 
         | Or if that isn't an option (laptop?) a good backup-solution
         | that that runs daily or more often is also a better and cheaper
         | alternative.
         | 
         | Drives may dail at any time, but they don't age the way your
         | post would suggest.
        
           | blindm wrote:
           | I've looked into RAID. It seems a bit complicated to use. Is
           | it trivial to create a RAID array in Linux with zero fuss and
           | the whole thing 'just working' with very little knowledge of
           | the filesystem itself other than it keeps your data 'safe'
           | and redundancy baked in?
        
             | dfinninger wrote:
             | Do you mean "use" or "setup"? RAID is trivial to use. Mount
             | the volume to a directory and use it like normal.
             | 
             | The setup is a bit more involved, but really not that bad.
             | It's a couple commands to join a few disks in an array and
             | then you make a file system and mount it.
             | 
             | https://www.digitalocean.com/community/tutorials/how-to-
             | crea...
        
         | marcolussetti wrote:
         | Every year seems like a very short lifespan, but I guess every
         | usecase is different. I definitely replace drive when SMART is
         | starting to look bleak, but that is far more infrequent in my
         | usecase I guess.
        
           | blindm wrote:
           | > Every year seems like a very short lifespan
           | 
           | Yes but I forgot to mention I do a lot of heavy writes to it.
           | It is common to see me creating a huge 20GB virtual machine
           | disk image, using it for a few hours, then deleting it,
           | before creating a new one in its place. I'm a huge
           | virtualization freak.
        
             | thekrendal wrote:
             | That's still nothing even if you do that 4x/day.
             | 
             | Also just because you create a 20GB virtual disk does not
             | necessarily mean you're actually writing out 20GB to the
             | disk.
             | 
             | Many SSDs and NVMEs are designed with total drive writes
             | per day in their specs.
             | 
             | What is the wear method you're measuring by and what's the
             | threshold where you're replacing your drives?
        
               | blindm wrote:
               | > does not necessarily mean you're actually writing out
               | 20GB to the disk.
               | 
               | You mean like preallocation? I think Virtualbox now does
               | that. In the past it didn't though, it just kept writing
               | a bunch of zeroes to the drive until it reached 20GB.
        
               | mmis1000 wrote:
               | Or probably the filesystem decides to do it? The Refs
               | will just ate adjacent 0 and assuming you want a
               | fallocate here.
        
             | magicalhippo wrote:
             | > It is common to see me creating a huge 20GB virtual
             | machine disk image, using it for a few hours, then deleting
             | it
             | 
             | The SSD in my current desktop, a Samsung 960 Pro 1TB, has a
             | warranty for 800 TBW or 5 years. So that's
             | 800/5/365.25*1000 ~= 438 GB per day, every single day.
             | 
             | And it's been documented the Samsung drives can do a lot
             | more than the warranty is good for.
             | 
             | Either you're doing something else weird, or you're not
             | really wearing them out.
             | 
             | [1]: https://www.samsung.com/semiconductor/minisite/ssd/pro
             | duct/c...
        
             | NikolaeVarius wrote:
             | That is absolutely nothing in terms of the write endurance
             | for modern drives
        
         | Felk wrote:
         | I think having a backup solution is the better choice here. You
         | can use your SSDs until they die or become too slow, and you
         | won't lose your data if it breaks before you replace it after a
         | year
        
           | blindm wrote:
           | > I think having a backup solution is the better choice here
           | 
           | Any particular provider you would recommend? I've looked into
           | backblaze but it seems a bit pricey. Also: I am aware that
           | cloud based backup solutions have very little failure rate in
           | terms of drives since they're probably using RAID
        
             | selectodude wrote:
             | $6/mo?
        
         | manigandham wrote:
         | Consumer SSDs have endurance ratings in TBW which is terabytes
         | written over the lifespan. They're often in the 100s with some
         | drives over 1000. The faster drives also use MLC or TLC which
         | has lower latency, better endurance, and higher performance
         | than the more higher capacity QLC.
         | 
         | For example the Samsung 1TB 970 PRO (not the 980 PRO) has a
         | 1200TBW rating with a 5 year warranty. That's 1.2M gigabytes
         | written or more than 600GB every day, and will usually handle
         | far more.
        
         | wtallis wrote:
         | No. Hardly anyone does this, because it's just conspicuous
         | consumption, not actually sensible. Have any of your SSDs ever
         | used even _half_ of their warrantied write endurance in a
         | single year?
        
         | foolmeonce wrote:
         | I would add a new drive with zfs mirroring and enable simple
         | compression. For most use cases it gets better read
         | performance, ok write performance, and can tolerate both of the
         | drives being a bit flaky so you can run it for a lot longer
         | than the new drive alone.
        
         | rcxdude wrote:
         | I've had one (very early and cheap) SSD fail on me. Other than
         | that I don't think I've seen or heard of any issues across a
         | large range of more modern SSDs. The reliability and endurange
         | issues which occured on earlier SSDs no longer seem to be a
         | problem (this is in part because flash density has skyrocketed:
         | because each flash chip can operate more or less independently,
         | the more storage an SSD has the faster it can run and the more
         | write endurance it has).
        
         | zbrozek wrote:
         | What do you do that wears them out so fast? I've been running
         | the same NVMe disk as my daily driver since 2015 and it's not
         | showing any signs of degradation.
        
           | blindm wrote:
           | > What do you do that wears them out so fast?
           | 
           | I forgot to mention I do a lot of heavy writes to it. It is
           | common to see me creating a huge 20GB virtual machine disk
           | image, using it for a few hours, then deleting it, before
           | creating a new one in its place. I'm a huge virtualization
           | freak.
        
             | ahupp wrote:
             | In a lot of these systems (at least VMWare back when I used
             | that, and Docker) you can clone an existing image with
             | copy-on-write. This is a lot faster and would avoid 20GB of
             | writes to spin up a new VM.
        
           | Blahah wrote:
           | I work with bioinformatics data and tend to switch out an
           | NVMe within 3-4 months. I'm usually maxing out read or write
           | for 12 out of 24 hours a day. The slowdown is rapid and very
           | noticeable.
        
             | wtallis wrote:
             | > The slowdown is rapid and very noticeable.
             | 
             | That probably doesn't have anything to do with write
             | endurance of the flash memory. When your drive's flash is
             | mostly worn-out, you will see latency affected as the drive
             | has to retry reads and use more complex error correction
             | schemes to recover your data. But there are several other
             | mechanisms by which a SSD's performance will degrade early
             | in its lifetime depending on the workload. Those
             | performance degradations are avoidable to some extent, and
             | are not permanent.
        
               | Blahah wrote:
               | So I can potentially recycle my used SSDs?
        
               | mehrdadn wrote:
               | I think you could give it a shot with ATA Secure Erasing
               | one of them and seeing if it performs faster. Although 4
               | months at 50% utilization at (say) 2GB/s is some ~10PB of
               | I/O, so I'm not sure if I would expect what you're seeing
               | to be a temporary slowdown...
        
               | wtallis wrote:
               | Almost certainly.
               | 
               | Assuming these are consumer SSD, the most important way
               | to maintain good performance is to ensure that it gets
               | some idle time. Consumer SSDs are optimized for burst
               | performance rather than sustained performance, and almost
               | all use SLC write caching. Depending on the drive and how
               | full it is, the SLC cache will be somewhere between a few
               | GB up to about a fourth of the advertised capacity. You
               | may be filling up the cache if you write 20GB in one
               | shot, but the drive will flush that cache in the
               | background over the span of a minute or two at most if
               | you don't keep it too busy.
               | 
               | The other good strategy to maintain SSD performance in
               | the face of a heavy write workload is to not let the
               | drive get full. Reserving an extra 10-15% of the drive's
               | capacity and simply not touching it will significantly
               | improve sustained write speeds. (Most enterprise SSD
               | product lines have versions that already do this; a 3.2TB
               | drive and a 3.84TB drive are usually identical hardware
               | but configured with different amounts of spare area.)
               | 
               | If a drive has already been pushed into a degraded
               | performance state, then you can either erase the whole
               | drive or, if your OS makes proper use of TRIM commands,
               | you can simply delete files to free up space. Then let
               | the drive have a few minutes to clean things up behind
               | the scenes.
        
         | [deleted]
        
         | porpoise wrote:
         | On the one hand, a new SSD a year sounds extreme.
         | 
         | On the other hand, how many years does each of us have left?
         | Ten? twenty? Thirty? Forty? Few of us can easily imagine
         | ourselves still alive and productive in forty years. So much of
         | what we do rests on an implicit assumption that we are going to
         | live for eternity, and starts to seem pointless when we
         | consider how short our existence is.
        
           | aszen wrote:
           | Very well said, there are times we lose the bigger picture of
           | our lives and instead start wasting times with pointless
           | stuff just to escape the reality of our lives.
        
           | CyberRabbi wrote:
           | Materialism (the dominant underlying philosophy of our
           | culture) keeps us away from that higher level consciousness.
           | It poisons our mental models and worldview.
        
         | gravypod wrote:
         | It will highly vary depending on use case. I have been using
         | the same SSD (Samsung 850 evo) since 2015. First used on my
         | gaming desktop, then on my college laptop, now in my gaming
         | desktop again. I just make sure to keep it at ~25% to ~50%
         | capacity to give the controller an easy time and I try to stick
         | to mostly read only workloads (gaming). SMART report from that
         | drive: https://pastebin.com/raw/HyPE6aHm
         | 
         | For my disk for my exact use case: ~4 years of operation. 88%
         | of lifespan remaining.
         | 
         | Your mileage will almost definitely vary.
        
         | LandR wrote:
         | I'm still on the SSD I bought 6 or 7 years ago as my OS drive.
         | 
         | Haven't noticed a single issue on it.
        
         | ggm wrote:
         | Can somebody please write up modern SSD and state of the world
         | regarding data retention, modes, applicability for SSD
         | replacing spinning rust "on the shelf" offline...
        
           | wtallis wrote:
           | SSDs make no sense for offline archival. They're more
           | expensive than hard drives and will be for the foreseeable
           | future. You don't need the improved random IO performance or
           | power efficiency for a drive that's mostly sitting on a
           | shelf.
        
         | CyberRabbi wrote:
         | When ssds fail they don't lose your data, they just become
         | unwritable. What you're doing is unnecessary and wasteful.
        
           | magicalhippo wrote:
           | I've only had two SSDs fail on me, and in both cases they
           | died without any warning. Didn't get discovered during boot
           | or anything. Two different brands, very different uses.
           | 
           | So while they _can_ fail in a graceful way, that's not been
           | my experience.
        
       ___________________________________________________________________
       (page generated 2020-11-26 23:00 UTC)