[HN Gopher] Modern storage is plenty fast, but the APIs are bad ___________________________________________________________________ Modern storage is plenty fast, but the APIs are bad Author : harporoeder Score : 317 points Date : 2020-11-26 15:40 UTC (7 hours ago) (HTM) web link (itnext.io) (TXT) w3m dump (itnext.io) | papi_bichhu wrote: | For a moment, I thought it's my personal list of "I don't know | these" | kibwen wrote: | Prior discussion from /r/rust where the author is present to | answer questions: | https://www.reddit.com/r/rust/comments/k16j6x/modern_storage... | Ericson2314 wrote: | From the author's previous piece: | https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-wi... | | > Our CTO, Avi Kivity, made the case for async at the Core C++ | 2019 event. The bottom line is this; in modern multicore, multi- | CPU devices, the CPU itself is now basically a network, the | intercommunication between all the CPUs is another network, and | calls to disk I/O are effectively another. There are good reasons | why network programming is done asynchronously, and you should | consider that for your own application development too. > > It | fundamentally changes the way Linux applications are to be | designed: Instead of a flow of code that issues syscalls when | needed, that have to think about whether or not a file is ready, | they naturally become an event-loop that constantly add things to | a shared buffer, deals with the previous entries that completed, | rinse, repeat. | | As someone that's been working on FRP related things for a while | now, this feels very vindicating. :) | | I few like as recently as a few years ago, the systems world was | content with it's incremental hacks, but now the gap between the | traditional interfaces and hardware realities has become too | much, and bigger redesigning is afoot. | | Excited for what emerges! | mehrdadn wrote: | > in modern multicore, multi-CPU devices, the CPU itself is now | basically a network, the intercommunication between all the | CPUs is another network, and calls to disk I/O are effectively | another. | | Interesting take, and NUMA CPUs have felt networked to me when | I've used them, but typical multicore UMA CPUs sure haven't... | is there a reason to believe this will change (or already has), | or did the author mean to only talk about NUMA? | ddorian43 wrote: | Both, even multicore. Same ideas are used by Red Panda (open | source kafka++ clone). | jeffbee wrote: | I guess it depends on what you consider to be "typical". If | you look at a many-core chip from AMD you'll see a gradient | of access latency from one core to another. You'll see the | same on any Intel Skylake-X descendant, although the slope of | that gradient is less. Your software will need to be very | highly optimized already before you start to sweat the | difference, though. | masklinn wrote: | Here's what jeffbee is talking about: | https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep- | di... | jeffbee wrote: | Yep. If you are sweating a microsecond, 100 nanoseconds | is significant chunk of your budget. For this and | possibly other reasons, a many-core CPU isn't always a | great choice for hosted storage. If your goal is to | export NVMe blocks over a network interface, you might be | better off with an easier-to-program 4- or 8-core CPU. I | don't like seeing 128 cores and a bunch of NVMe devices | in the same box because it just causes trouble. | magicalhippo wrote: | With p2pdma[1], it seems you don't even need a fancy CPU | to push GB/s. | | [1]: https://www.youtube.com/watch?v=LDOlqgUZtHE (9:40) | hinkley wrote: | > that have to think about whether or not a file is ready, they | naturally become an event-loop that constantly add things to a | shared buffer, deals with the previous entries that completed, | rinse, repeat. | | This is the exception, not the rule, and it bugs me when APIs | default to this. Most consumers of data are not looking at the | stream of data, and in many cases where streaming is what I | want, there are tools and APIs for handling that outside of my | application logic. Much of the time I'm dealing with units of | data only after the entire unit has arrived. Because if the | message is not complete there is no forward progress to be | made. | | My tools should reflect that reality, not what's quickest for | the API writers to create. | | In fact, if I remember my queuing theory properly, | responsiveness is improved if the system prioritizes IO | operations that can be finished (eg, EOS, EOF) over processing | buffers for one that is still in the middle, which can't happen | with an event stream abstraction. | pjmlp wrote: | The problem is forcing everyone to program in asynchronous way. | | WinRT tried to go down that route (only asynchronous APIs), to | drive developers into that path, but eventually they had to | support synchronous as well due to the received resistence. | justicezyx wrote: | Before Herb Sutter's free lunch is over, not many writes in | multi-threading code. Then today, everyone is writing multi- | threading code, one way or another; and majority indirectly | in newer languages like Go and Rust, or better frameworks | like Actors, message passing, coroutines, and some noble | souls who are capable enough, in classic pthread and other | threading APIs. | | Of course everyone is going to program in an async way; | that's how nature works. | | But it certainly will not be in a fashinon that is repulsive | to you, just give it some time. Maybe 10 years. | pjmlp wrote: | Agreed, it is also why Microsoft was one of the biggest | contributors to the co-routines support in C++, and have | frameworks like Orleans and Coyote. | amelius wrote: | What I hate about Unix filesystems: the fact that you can't take | a drive, put it in another computer and have permissions | (user/group-ids) working instantly. Same for sharing over nfs. | | Of course, people have tried to solve this, but I think not well | enough. It's a huge amount of technical debt right there in the | systems we use every day. | bufferoverflow wrote: | Because permissions is an OS-enforced feature, not the | filesystem feature. If you have access to the drive, | permissions are meaningless. | amelius wrote: | Permissions (and ownership info) can be useful even if you | have complete access to a filesystem. | | By the way, assume you have root permission. How would you | replace a single file in a random tar-file, without changing | any of the permissions/userids/groupids inside the tar-file? | You can't untar it because the users inside the tar file | don't correspond with the ones on your system. So, you'll | have to use special tools, which is (only) one demonstration | of the inadequacy of the permissions mechanism of our | filesystems. | zackmorris wrote: | I agree with the premise, but disagree with the conclusion. | | For a little background, my first computer was a Mac Plus around | 1985, and I remember doing file copy tests on my first hard drive | (an 80 MB) at over 1 MB/sec. If I remember correctly, SCSI could | do 5 MB/sec copies clear back in the mid-80s. So until we got | SSD, hard drive speed stayed within the same order of magnitude | for like 30 years (as most of you remember): | | http://chrislawson.net/writing/macdaniel/2k1120cl.shtml | | So the time to take our predictable deterministic synchronous | blocking business logic into the maze of asynchronous promise | spaghetti was a generation ago when hard drive speeds were two | orders of magnitude slower than today. | | In other words, fix the bad APIs. Please don't make us shift | paradigms. | | Now if we want to talk about some kind of compiled or graph- | oriented way of processing large numbers of files performantly | with some kind of async processing internally, then that's fine. | Note that this solution will mirror whatever we come up with for | network processing as well. That was the whole point of UNIX in | the first place, to treat file access and network access as the | same stream-oriented protocol. Which I think is the motive behind | taking file access into the same problematic async domain that | web development is having to deal with now. | | But really we should get the web back to the proven UNIX/Actor | model way of doing things with synchronous blocking I/O. | tobias3 wrote: | Intuitively one should be able to approach the max speed for | sequential reads via some tuning (queue/read_ahead_kb) even with | the traditional, blocking posix interface. This would require a | large enough read-ahead and large enough buffer size. Not | poisoning the page cache/manually managing the page cache is an | orthogonal issue and only relevant for some applications (and the | additional memory copy barely makes a difference in OPs post). | | One advantage of using high level (Linux) kernel interfaces is | that this "automatically" gets faster with newer Linux versions | without a need of large application level changes. Maybe in a few | years we'll have an extra cache layer, or it stores to persistent | memory now. Linux will (slowly) improve and your application with | it. This won't happen if it is specifically tuned for Direct I/O | with Intel Optane in 2020. | | But yeah, random IO is (currently) another issue, and as said the | usual advice is to avoid them. And with the old API this still | holds. If one currently wants fast random IO one needs to use | io_uring/aio (with Direct-IO) or just live with the performance | not being optimal and hope that the page cache does more good | than bad (like Postgresql). | jorangreef wrote: | The page cache is not reliable, and actually does more bad than | good, especially in the case of PostgreSQL: | | https://www.usenix.org/conference/atc20/presentation/rebello | [deleted] | Thaxll wrote: | How those modern API runs on old HW? | quelsolaar wrote: | This is such a big deal! The assumptions made when IO APIs where | designed are so out-of-step with today's hardware that it really | is a time to have a big rethink. In graphics, the last 20 years | of API development have very much been focused on harnessing a | GPU that have again and again outgrown the CPUs ability to feed | it. So much have been learned, and we really need to apply this | to both storage and networking. | ivoras wrote: | > ...misconceptions... Yet if you skim through specs of modern | NVMe devices you see commodity devices with latencies in the | microseconds range and several GB/s of throughput supporting | several hundred thousands random IOPS. So where's the disconnect? | | Whoa there... let's not compare devices with 20+ GB/s and | latencies in nanosecond ranges which translate to half a dozen | giga-ops per second (aka RAM) with any kind of flash-based | storage just yet. | hinkley wrote: | We've been using Ethernet cards for storage because the network | round trip to RAM over TCP/IP on another machine in the same | rack is far cheaper than accessing local storage. Latency | compared to that option is likely the most noteworthy | performance gain. | | My understanding of distributed computing history is that the | last time network>local storage happened was in the 80's, and | most of the rest of the history of computing, moving the data | physically closer to the point of usage has always been faster. | | Just as then, we've taken a pronounced software architecture | detour. This one has lasted much longer, but it can't really | last forever. With this new generation of storage, we'll | probably see a lot of people trotting out 90's era system | designs as if they are new ideas rather than just regression to | the mean. | | Same as it ever was. | wtallis wrote: | The article isn't exactly conflating RAM and flash; if it were, | the conclusions would be very different. A synchronous blocking | IO API is fine if you're working with nanosecond latencies of | RAM, or with storage that's as painfully slow and serial as a | mechanical hard drive. | | Flash is special in that its latency is still considerably | higher than that of DRAM, but its _throughput_ can get | reasonably close once you have more than a handful of SSDs in | your system (or if you 're willing to compare against the DRAM | bandwidth of a decade-old PC). Extracting the full throughput | from a flash SSD despite the higher-than-DRAM latency is what | requires more suitable APIs (if you're doing random IO; | sequential IO performance is easy). | gogopuppygogo wrote: | Sustainable read/write speeds are also different than peak on | SSD vs RAM. | wtallis wrote: | True, but that applies more to writes than reads. Most | real-world workloads do a lot more reads than writes, and | what writes they do perform can usually tolerate a lot of | buffering in the IO stack to further reduce the overall | impact of low write performance on the underlying storage. | jorangreef wrote: | The advertised bandwidth for RAM is not actually what you get | per-core, which is what you care about in practice. | | If you want to know the upper bound on your per-core RAM | bandwidth: | | 64 bytes (the size of a cache line) * 10 slots (in a CPU core's | LFB or line fill buffer) / 100ns (the typical cost of a cache | miss) * 1000000 * 1000 (to convert ns to ms to seconds) = | 6400000000 bytes per second = 5.96 GiB per second RAM bandwidth | per core | | There's no escaping that upper bound per core. | | Nanosecond RAM latencies don't help much when you're capped by | the line fill buffer and queuing delay kicks in spiking your | cache miss latencies. You can only fetch 10 lines at a time per | core and when you exceed your 5.96 GiB per second budget your | access times increase. | | If you compare with NVMe SSD throughput plus Direct I/O plus | io_uring, around 32 GIB per second and divide that by 10 | according to the difference in access latencies, then I think | the author is about right on target. The point they are making | is valid: it's the same order of magnitude. | wmf wrote: | What about prefetching? Tiger Lake gets over 20 GB/s per | core. https://www.anandtech.com/show/16084/intel-tiger-lake- | review... | jorangreef wrote: | Beats me! | throwaway_pdp09 wrote: | From your link | | > In the DRAM region we're actually seeing a large change | in behaviour of the new microarchitecture, with vastly | improved load bandwidth from a single core, increasing from | 14.8GB/S to 21GB/s | | Yeah, that's odd. But the article's really about cache, so | maybe it's a mistake. Next para says | | > More importantly, memory copies between cache lines and | memory read-writes within a cache line have respectively | improved from 14.8GB/s and 28GB/s to 20GB/s and 34.5GB/s. | | so it looks like it's talking about cache not ram but... | _shrug_ | sgtnoodle wrote: | While I was in the hospital ICU earlier this year, I promised | myself I would build a zen 3 desktop when it came out despite | my 10 year old desktop still working just fine. | | I've since bought all the pieces but the CPU; they are all | sold out. So I got a 6 core 3600XT in the interim. I bought | fairly high binned RAM and overclocked it to 3600Mhz, and was | surprised to cap out at about 36GB/s throughput. Your 6GiB/s | per core explanation checks out for me! | jorangreef wrote: | Cool! I had a similar empirical experience working on a | Cauchy Reed-Solomon encoder awhile back, which is | essentially measuring xor speed, but I just couldn't get it | past 6 GiB/s per core either, until I guessed I was hitting | memory bandwidth limits. Only a few weeks ago I stumbled on | the actual formula to work it out! | throwaway_pdp09 wrote: | > capped by the line fill buffer and queuing delay kicks in | spiking your cache miss | | could you point me to a little reading material on this? I | know what an LFB is, more or less, but what queueing delay, | an dhow does that relate to cache misses? Thanks. | jorangreef wrote: | Sure, I'm still pretty fuzzy on these things, but queueing | delay is Little's law: | https://en.wikipedia.org/wiki/Little's_law | | It means if a system can only do X of something per second, | then if you push the system past that, new arriving stuff | has to wait on existing work in the queue, and things take | longer than if the queue was empty. You can think of it | like a traffic jam and it applies to most systems. | | For example, our local radio station here in Cape Town | loves to talk about "queuing traffic" when they do the 8am | traffic report, and I always think of Little's law. | | Bufferbloat is another example of queueing delay, e.g. | where you fill the buffer of your network router say with a | large Gmail attachment upload and spike the network ping | times for everyone else sharing the same WiFi. | | Here is where I got the per-core bandwidth calculation | from: https://www.eidos.ic.i.u-tokyo.ac.jp/~tau/lecture/par | allel_d... | throwaway_pdp09 wrote: | Appreciated, thanks | gravypod wrote: | Depending on the storage technology the comparison to RAM is | not that far off. Intel is trying to market it that way any way | [0]. It's obviously not RAM but it's not the <500GB 5200RPM | SATA 3GB/s disk I started programming on. | | [0] - https://www.intel.com/content/www/us/en/architecture-and- | tec... | smcameron wrote: | Yeah, back in 2014, I worked at HP on storage drivers for | linux, and we got 1 million IOPS (4k random reads) on a | single controller, with SSDs, but we had to do some fairly | hairy stuff. This was back when NVME was new and we were | trying to do SCSI over PCIe. We set up multiple ring buffers | for command submission and command completion, one each per | CPU, and pinned threads to CPUs and were very careful to | avoid locking (e.g. spinlocks, etc.). I think we also had to | pin some userland processes to particular CPUs to avoid NUMA | induced bottlenecks. | | The thing is, up until this point, for the entire history of | computers, storage was so relatively slow compared to memory | and the CPU that drivers could be quite simple, chuck | requests and completions into queues managed by simple | locking, and the fraction of time that requests spent inside | the driver would still be negligible compared to the time | they spent waiting for the disks. If you could theoretically | make your driver infinitely fast, this would only amount to | maybe a 1% speedup. So there was no need to spend a lot of | time thinking about how to make the driver super efficient. | Until suddenly there was. | smcameron wrote: | Oh yeah, iirc, the 1M IOPS driver was a block driver. For | the SCSI over PCIe stuff, there was the big problem at the | time that the entire SCSI layer in the kernel was a | bottleneck, so you could make the driver as fast as you | wanted, but your requests were still coming through a | single queue managed by locks, so you were screwed. There | was a whole ton of work done by Christoph Hellwig, Jens | Axboe and others to make the SCSI layer "multiqueue" around | that time to fix that. | jeroenhd wrote: | I suppose _modern_ storage is fast, but how many servers are | running on storage this modern? None of mine are and my work dev | machine is still rocking a SATA 2.5" SSD. | | We're probably still a few years off from being able to switch to | this fast I/O yet. With the new game consoles switching over to | PCIe SSDs I expect the price of NVMe drives to drop over the next | few years until they're cheap enough that the majority of | computers are running NVMe drives. | | Even with SATA drives like mine though, there's really not that | much performance loss from doing IO operations. I've run my OS | with 8GiB of SSD swap in active use during debugging and while | the stutters are annoying and distracting, the computer didn't | grind to a halt like it would with spinning rust. Storage speed | has increased massively in the last five years, for the love of | god fellow developers, please make use of it when you can! | | That said, deferring IO until you're done still makes sense for | some consumer applications because cheap laptops are still being | sold with hard drives and those devices are probably the minimum | requirement you'll be serving. | wtallis wrote: | > I expect the price of NVMe drives to drop over the next few | years until they're cheap enough that the majority of computers | are running NVMe drives. | | Price no longer has anything to do with it. PC OEMs are simply | not shipping SATA SSDs any more, and major drive vendors have | started to discontinue their client (OEM) SATA SSD product | lines. We're just waiting for the SATA-based PC install base to | be retired. | im3w1l wrote: | My mobo has many more SATA slots than M.2. slots. I expect | there will be hybrid systems for quite a while. | wtallis wrote: | One SSD is sufficient for almost all consumer systems. The | only reason to want more than two SSDs is if you're re- | using at least one old tiny SSD in a new machine. SATA | ports will stick around in desktops only for the sake of | hard drives. There may be a few niches left where using | several SATA SSDs in a workstation still makes some kind of | sense, and obviously not all server platforms have migrated | to NVMe yet. But as far as influencing the direction and | design of consumer systems, SATA SSDs have only slightly | more relevance than optical disc drives. | p1necone wrote: | Drive price doesn't scale linearly with capacity, you can | save a fair bit of money sticking with multiple smaller | capacity drives vs one big one. | aidenn0 wrote: | I have 8 SATA SSDs in my workstation; are there motherboards | that could run a similar NVMe setup? | magicalhippo wrote: | You can get NVME PCIe cards which has on-board PCIe switch. | Random example, here's[1] one with 4 M.2 slots sharing an | x8 PCIe slot. | | Obviously sustained bandwidth is limited to that of | effectively two NVME devices, but if you're doing lots of | random I/O I guess it's a win. | | [1]: https://www.aliexpress.com/item/4000034598072.html | bhewes wrote: | Sure you could run 24 NVMes with highpoint pcie4 raids on a | trx40 board. But then most still have like 10 sata ports so | you can run those as well. It will be great when sata is | replaced by U.2 but who knows when that happens. | wtallis wrote: | I wasn't including the workstation market when I referred | to what PC OEMs are doing. | | Are you using 8 _consumer_ SATA SSDs in your workstation? | Is it for the sake of increased capacity, or for the sake | of increased performance? Because it 's pretty easy now to | match the performance of an 8-drive SATA RAID-0 with a | single NVMe drive, but 8TB consumer NVMe SSDs are still 50% | more expensive than 8TB consumer SATA SSDs. | | (Also, even 8 SATA ports is above average for consumer | motherboards; it looks like about 17% of the retail desktop | motherboard models currently on the market have at least 8 | SATA ports.) | aidenn0 wrote: | Increased capacity. I started with 4 spinning disks, | replaced them with SSDs a while ago and then grew it to | 8. | eightysixfour wrote: | Yes, using PCIe expansion cards. I know of an AMD board | that ships with 5 (3 on the board, 2 with a PCIe card). | Could easily add more. | fibers wrote: | is this with threadripper boards? | wtallis wrote: | 3 M.2 slots is common even on AMD's mainstream X570 and | B550 platforms. I don't know if any of those motherboards | also bundle riser cards for further M.2 PCIe SSDs, but | they do support PCIe bifurcation so you can run your GPU | at PCIe 4.0 x8 and use the second x16/x8 slot to run two | more SSDs in a passive riser purchased separately. | eightysixfour wrote: | No, just an x570, the MSI "Godlike". You can also just | buy PCIe cards with M2 slots for drives. | digikata wrote: | Right now it's a bit more specialized to storage oriented | server platforms that can run in the 10-40 NVMe devices. | You get this sort of imbalance where any one or two high | performance NVMe devices at full throughput can push more | I/O than a single high end network link | rcxdude wrote: | Well, if the whole async 'I/O is the bottleneck' principle | which was the refrain from a few years ago is actually true, | then servers running databases should be focusing on upgrading | their storage to these levels, since that's where the most bang | for buck comes from in terms of performance gain is (of course | now the big thing is running everything on clouds like AWS | where everything is dog slow and really expensive, so perhaps | it doesn't actually matter). (In fact the main reason for the | API changes covered in the article is because the CPU and RAM | can no longer run laps around storage). | scribu wrote: | > how many servers are running on storage this modern? | | AWS allows you to provision EC2 instances with NVMe, without | much fanfare. Cost is only ~20% more than SATA. | kazinator wrote: | This is a really poor article. Only in very rare circumstances | can developers change the API's. API's are not "bad"; they are | built to various important requirements. Only some of those | requirements have to do with performance. | | > _"Well, it is fine to copy memory here and perform this | expensive computation because it saves us one I /O operation, | which is even more expensive"._ | | "I/O operation" in fact refers to the API call, not to the raw | hardware operation. If the developer measured this and found it | true, how can it be a misconception? It may be caused by a "bad" | I/O API, but so what? The API is what it is. | | API's provide one requirement which is stability: keeping | applications working. That is king. You can't throw out API's | every two years due to hardware advancements. | | > _"If we split this into multiple files it will be slow because | it will generate random I /O patterns. We need to optimize this | for sequential access and read from a single file"_ | | Though solid state storage doesn't have track-to-track seek | times, the sequential-access-fast rule of thumb has not become | false. | | Random access may have to wastefully read larger blocks of the | data than are actually requested by the application. The unused | data gets cached, but if it's not going to be accessed any time | soon, it means that something else got wastefully bumped out of | the cache. Sequential access is likely to make use of an entire | block. | | Secondly, there is that API again. The underlying operating | system may provide a read-ahead mechanism which reduces its own | overheads, benefiting the application which structures its data | for sequential access, even if there is no inherent hardware- | level benefit. | | If there is any latency at all between the application and the | hardware, and if you can guess what the application is going to | read next, that's an opportunity to improve performance. You can | correctly guess what the application will read if you guess that | it is doing a sequential read, and the application makes that | come true. | cahooon wrote: | I didn't get the impression that the author was suggesting to | throw out the old APIs. It seems to me like the article is a | proof of concept of new approaches that could be added as new | APIs, only expected to be used by people who need them, using | an approach that takes advantage of modern storage technology. | | > "Random access may have to wastefully read larger blocks of | the data than are actually requested by the application. The | unused data gets cached, but if it's not going to be accessed | any time soon, it means that something else got wastefully | bumped out of the cache. Sequential access is likely to make | use of an entire block." | | I may have misread it, but I thought he addressed this in the | article. | | > "Random access files take a position as an argument, meaning | there is no need to maintain a seek cursor. But more | importantly: they don't take a buffer as a parameter. Instead, | they use io_uring's pre-registered buffer area to allocate a | buffer and return to the user. That means no memory mapping, no | copying to the user buffer -- there is only a copy from the | device to the glommio buffer and the user get a reference | counted pointer to that. And because we know this is random | I/O, there is no need to read more data than what was | requested." | Cojen wrote: | I found this to be a good read, but I wish the author discussed | the pros/cons of bypassing the file system and using a block | device with direct I/O. I've found that with Optane drives the | performance is high enough that the extra load from the file | system (in terms of CPU) is significant. If the author was using | a file system (which I assume is the case) which was it? | bob1029 wrote: | One thing I have started to realize is that best case latency of | an NVMe storage device is starting to overlap with areas where | SpinWait could be more ideal than an async/await API. I am mostly | advocating for this from a mass parallel throughput perspective, | especially if batching is possible. | | I have started to play around with using LMAX Disruptor for | aggregating a program's disk I/O requests and executing them in | batches. This is getting into levels of throughput that are | incompatible with something like what the Task abstractions in | .NET enable. The public API of such an approach is synchronous as | a result of this design constraint. | | Software should always try to work with the physical hardware | capabilities. Modern SSDs are most ideally suited to arrangements | where all data is contained in an append-only log with each batch | written to disk representing a consistent snapshot. If you are | able to batch thousands of requests into a single byte array of | serialized modified nodes, you can append this onto disk so much | faster than if you force the SSD to make individual writes per | new/modified entity. | wtallis wrote: | On Linux, it's already a NVMe driver option to enable polling | for (high priority) IO completion rather than sleeping until an | interrupt. The latency of handling an interrupt and doing a | couple of context switches is higher than the best-case latency | for fast SSDs. The io_uring userspace API also has a polling | mode. | mehrdadn wrote: | I liked most of the piece, but some bits rubbed me the wrong way: | | > I was taken by surprise by the fact that although every one of | my peers is certainly extremely bright, most of them carried | misconceptions about how to best exploit the performance of | modern storage technology leading to suboptimal designs, even if | they were aware of the increasing improvements in storage | technology. | | > In the process of writing this piece I had the immense pleasure | of getting early access to one of the next generation Optane | devices, from Intel. | | The entire blog post is complaining about how great engineers | have misconception about modern storage technology and yet to | prove it the author had to obtain benchmarks from _early_ access | to _next-generation_ devices...?! And to top it off, from this we | conclude "the disconnect" is due to the _APIs_? Not, say, from | the possibility that such blazing-fast components may very well | not even _exist_ in users ' devices? I'm not saying the | conclusions are wrong, but the logic surely doesn't follow... and | honestly it's a little tasteless to criticize people's | understanding if you're going to base the criticism on things | they in all likelihood don't even have access to. | arka2147483647 wrote: | I read that as that it was what he had at hand when running the | tests. | | Also, if the speed and features are available as professional | grade devices today, it will be available everywhere in a few | years. | Miraste wrote: | Optane has been commercially available for five years already | and it's not used in any device I'm aware of. Assuming it | will find broad adoption at this point seems like a bad bet. | stingraycharles wrote: | I know a few optane deployments in finance, but other than | that, it seems incredibly difficult to justify the steep | price. | mehrdadn wrote: | 3 years I think: https://en.wikipedia.org/wiki/3D_XPoint | | > It was announced in July 2015 and is available on the | open market under brand names Optane (Intel) and | subsequently QuantX (Micron) since April 2017. | | For comparison, look at how many decades it took SSDs to | become commonplace: https://en.wikipedia.org/wiki/Solid- | state_drive#Flash-based_... | echlebek wrote: | Consumer NVMe devices can deliver GB/s I/O and hundreds of | thousands of iops. The article's point doesn't hinge on | Optane at all. | mistrial9 wrote: | please ask my several NVMe devices to take notice ! | actual performance under Linux OS is far less than that, | here | snovv_crash wrote: | I just tested my laptop with the Ubuntu benchmark tool on | the partition editor. 3.5GB/s read on 100MB chunks. | magicalhippo wrote: | Single-thread, single-queue performance is much lower | than the max with good NVMe devices. | | With increased concurrency and deeper queues, my Samsung | 960 Pro which has been running my Windows 10 desktop for | several years still can do 294k random 4k reads IOPS, and | 2.5GB/s sequential read. | rtkwe wrote: | Yeah how many people are running apps on servers served at all | or even partially by NVMe SSDs? Where I work for our on prem | stuff it's basically all network storage. | sleepydog wrote: | For network storage some of his points are even stronger. | Sure, the page cache becomes more useful as latency goes up | but it also becomes more important to send more I/O at once, | something that is hard to do with blocking APIs like read(2) | and write(2). The page cache is pretty good at optimizing | sequential I/O to do this, but not random I/O or workloads | where you need to sync(). | danuker wrote: | > network storage | | Do you mean cloud storage? as in, other people's computers? | karamanolev wrote: | I conjecture he means SANs, iSCSI, NFS, Fibre Channel and | other on-prem, but still not local to the server where the | compute is running. | aden1ne wrote: | It probably means NFS. | mikepurvis wrote: | It's a pretty common pattern to have a fleet of big beefy | VM hosts all backed by a single giant SAN on a 10gbe | switch. This lets you do things like seamlessly migrate a | VM from one host to another, or do a high availability | thing with multiple synchronized instances and automatic | failover (VMWare called this all "vMotion"). In any case, | lots of bandwidth to the storage, but also high latency, at | least relative to a locally-connected SATA or PCIe SSD. | | So yeah, if that's your setup, you don't have much of an | option in between your SAN and allocating an in-machine | ramdisk, which will be super fast and low latency, but also | extremely high cost. | tpurves wrote: | Why not consider nVME in this case then as cheaper than | RAM, slower than RAM, but faster than network storage? I | don't know how you handle concurrency btwn VMs or | virtualize that storage, but there must be some standard | for that? | mikepurvis wrote: | I think a lot of it depends what the machines are used | for. I'm not actually the IT department, but I believe in | my org, we started out with a SAN-backed high | availability cluster, because the immediate initial need | was getting basic infrastructure (wiki, source control, | etc) off of a dedicated machine that was a single point | of failure. | | But then down the road a different set of hosts were | brought online that had fast local storage, and those | were used for short term, throwaway environments like | Jenkins builders, where performance was far more | important than redundancy. | tonyarkles wrote: | I'm laughing a little bit because an old place I used to | work had a similar setup. The SAN/NAS/whatever it was was | pretty slow, provisioning VMs was slow, and as much as we | argued that we didn't need redundancy for a lot of our | VMs (they were semi-disposable), the IT department | refused to give us a fast non-redundant machine. | | And then one day the SAN blew up. Some kind of highly | unlikely situation where more disks failed in a 24h | period than it could handle, and we lost the entire | array. Most of the stuff was available on tapes, but | rebuilding the whole thing resulted in a significant | period of downtime for everyone. | | It ended up being a huge win for my team, since we had | been in the process of setting up Ansible scripts to | provision our whole system. We grabbed an old machine and | had our stuff back up and running in about 20 minutes, | while everyone else was manually reinstalling and | reconfiguring their stuff for days. | mikepurvis wrote: | Ha, that's awesome. Yeah, for the limited amount of stuff | I maintain, I really like the simple, single-file Ansible | script-- install these handful of packages, insert this | config file, set up this systemd service, and you're | done. I know it's a lot harder for larger, more | complicated systems where there's a lot of configuration | state that they're maintaining internal to themselves and | they want you to be setting up in-band using a web gui. | | I experienced this recently trying to get a Sentry 10 | cluster going-- it's now this giant docker-compose setup | with like 20 containers, and so I'm like "perfect, I'll | insert all the config in my container deployment setup | and then this will be trivially reproducible." Nope, | turns out the particular microservice that I was trying | to configure only uses its dedicated config file for | standalone/debugging purposes; when it's being launched | as part of the master system, everything is passed in at | runtime and can only be set up from within the main app. | Le sigh. | [deleted] | digikata wrote: | There are a series of hardware updates and software | bottlenecks to work through before access becomes more | common, the performance will bubble up from below starting | with more widespread NVMe devices, then faster NVMe over | fabrics hardware will become more common, and drivers, | hypervisors, filesystems, and storage apps will likely have | to rethink things to re-optimize. That means different times | in terms of showing up in the cloud/on-prem/etc. | smcleod wrote: | You can do pretty amazing things with well designed (onprem) | networked storage with NVMe drives/arrays. | | I replaced the company I was working with at the time's | traditional "enterprise" HPE SANs with standard linux servers | running a mix of NVMe and SATA SSDs that provided highly | available, low latency and decent throughput iSCSI via | network. | | Gen 1 back in 2014/2015 did something like 70K random 4k | read/write IOP/s per VM (running on Xen back then) and would | just keep scaling till you hit the clusters 4M~ IOP/s limit | (minus some overhead obviously). | | Gen 2 provided between 100K and 200K random 4k to each VM to | a limit of about 8M~ on the underlying units (which again | were very affordable and low maintenance). | | This provided very good storage performance (latency, | throughput and fast / minimally if at all disruptive fail- | over and recovery) for our apps, some of them were written in | highly blocking Python code and needed to be rewritten async | to get the most out of it, but it made a _huge_ (business | changing) difference and saved us an insane amount of money. | | These days I've moved into consulting and all the work I do | is on GCP and AWS but I do miss the hands on high performing | gear like that. | | Old stuff now but the links are https://www.dropbox.com/s/rdo | jhb399639e4k/lightning_san.pdf?... and | https://smcleod.net/tech/2015/07/24/scsi-benchmarking/ and | there's a few other now quite dated posts on there. | charrondev wrote: | We did a big upgrade upgrade a couple years ago moving all of | our DBs onto NVMe SSDs. We get significant improvements to | our query times. | | Fast SSDs are pretty cheap nowadays. | PaulDavisThe1st wrote: | There's more to life than "apps on servers". | | People doing audio/video work with lots of input streams can | also max out disk I/O throughput (quite easily before NVMe | SSDs; not so much anymore). | simcop2387 wrote: | This is starting to change a bit because of things like the | DPUs that companies are making. Basically it's an intelligent | PCI-e <=> network bridge that lets you emulate/share PCI-e | devices on the host while the actual hardware (NVMe storage, | GPU, etc.) is located elsewhere. This lets you reconfigure | the host in software without having to physically change the | hardware in the servers itself. It also lets you change the | way you have things in the server rack since everything | doesn't need to be able to physically fit into every other | server case. | | EDIT: informative article about DPUs, | https://www.servethehome.com/what-is-a-dpu-a-data- | processing... | blindm wrote: | Also: Modern storage is plenty fast, but also not reliable for | long term use. | | That is why I buy a new SSD every year and clone my current (worn | out) SSD to the new one. I have several old SSDs that started to | get unhealthy, well, according to my S.M.A.R.T utility that I | used to check them. I could probably get away with using an SSD | for another year, but will not risk the data loss. Anyone else do | this? | tjoff wrote: | The solution for this problem is raid. Way cheaper and far | superior in terms of reliability than your solution. | | Or if that isn't an option (laptop?) a good backup-solution | that that runs daily or more often is also a better and cheaper | alternative. | | Drives may dail at any time, but they don't age the way your | post would suggest. | blindm wrote: | I've looked into RAID. It seems a bit complicated to use. Is | it trivial to create a RAID array in Linux with zero fuss and | the whole thing 'just working' with very little knowledge of | the filesystem itself other than it keeps your data 'safe' | and redundancy baked in? | dfinninger wrote: | Do you mean "use" or "setup"? RAID is trivial to use. Mount | the volume to a directory and use it like normal. | | The setup is a bit more involved, but really not that bad. | It's a couple commands to join a few disks in an array and | then you make a file system and mount it. | | https://www.digitalocean.com/community/tutorials/how-to- | crea... | marcolussetti wrote: | Every year seems like a very short lifespan, but I guess every | usecase is different. I definitely replace drive when SMART is | starting to look bleak, but that is far more infrequent in my | usecase I guess. | blindm wrote: | > Every year seems like a very short lifespan | | Yes but I forgot to mention I do a lot of heavy writes to it. | It is common to see me creating a huge 20GB virtual machine | disk image, using it for a few hours, then deleting it, | before creating a new one in its place. I'm a huge | virtualization freak. | thekrendal wrote: | That's still nothing even if you do that 4x/day. | | Also just because you create a 20GB virtual disk does not | necessarily mean you're actually writing out 20GB to the | disk. | | Many SSDs and NVMEs are designed with total drive writes | per day in their specs. | | What is the wear method you're measuring by and what's the | threshold where you're replacing your drives? | blindm wrote: | > does not necessarily mean you're actually writing out | 20GB to the disk. | | You mean like preallocation? I think Virtualbox now does | that. In the past it didn't though, it just kept writing | a bunch of zeroes to the drive until it reached 20GB. | mmis1000 wrote: | Or probably the filesystem decides to do it? The Refs | will just ate adjacent 0 and assuming you want a | fallocate here. | magicalhippo wrote: | > It is common to see me creating a huge 20GB virtual | machine disk image, using it for a few hours, then deleting | it | | The SSD in my current desktop, a Samsung 960 Pro 1TB, has a | warranty for 800 TBW or 5 years. So that's | 800/5/365.25*1000 ~= 438 GB per day, every single day. | | And it's been documented the Samsung drives can do a lot | more than the warranty is good for. | | Either you're doing something else weird, or you're not | really wearing them out. | | [1]: https://www.samsung.com/semiconductor/minisite/ssd/pro | duct/c... | NikolaeVarius wrote: | That is absolutely nothing in terms of the write endurance | for modern drives | Felk wrote: | I think having a backup solution is the better choice here. You | can use your SSDs until they die or become too slow, and you | won't lose your data if it breaks before you replace it after a | year | blindm wrote: | > I think having a backup solution is the better choice here | | Any particular provider you would recommend? I've looked into | backblaze but it seems a bit pricey. Also: I am aware that | cloud based backup solutions have very little failure rate in | terms of drives since they're probably using RAID | selectodude wrote: | $6/mo? | manigandham wrote: | Consumer SSDs have endurance ratings in TBW which is terabytes | written over the lifespan. They're often in the 100s with some | drives over 1000. The faster drives also use MLC or TLC which | has lower latency, better endurance, and higher performance | than the more higher capacity QLC. | | For example the Samsung 1TB 970 PRO (not the 980 PRO) has a | 1200TBW rating with a 5 year warranty. That's 1.2M gigabytes | written or more than 600GB every day, and will usually handle | far more. | wtallis wrote: | No. Hardly anyone does this, because it's just conspicuous | consumption, not actually sensible. Have any of your SSDs ever | used even _half_ of their warrantied write endurance in a | single year? | foolmeonce wrote: | I would add a new drive with zfs mirroring and enable simple | compression. For most use cases it gets better read | performance, ok write performance, and can tolerate both of the | drives being a bit flaky so you can run it for a lot longer | than the new drive alone. | rcxdude wrote: | I've had one (very early and cheap) SSD fail on me. Other than | that I don't think I've seen or heard of any issues across a | large range of more modern SSDs. The reliability and endurange | issues which occured on earlier SSDs no longer seem to be a | problem (this is in part because flash density has skyrocketed: | because each flash chip can operate more or less independently, | the more storage an SSD has the faster it can run and the more | write endurance it has). | zbrozek wrote: | What do you do that wears them out so fast? I've been running | the same NVMe disk as my daily driver since 2015 and it's not | showing any signs of degradation. | blindm wrote: | > What do you do that wears them out so fast? | | I forgot to mention I do a lot of heavy writes to it. It is | common to see me creating a huge 20GB virtual machine disk | image, using it for a few hours, then deleting it, before | creating a new one in its place. I'm a huge virtualization | freak. | ahupp wrote: | In a lot of these systems (at least VMWare back when I used | that, and Docker) you can clone an existing image with | copy-on-write. This is a lot faster and would avoid 20GB of | writes to spin up a new VM. | Blahah wrote: | I work with bioinformatics data and tend to switch out an | NVMe within 3-4 months. I'm usually maxing out read or write | for 12 out of 24 hours a day. The slowdown is rapid and very | noticeable. | wtallis wrote: | > The slowdown is rapid and very noticeable. | | That probably doesn't have anything to do with write | endurance of the flash memory. When your drive's flash is | mostly worn-out, you will see latency affected as the drive | has to retry reads and use more complex error correction | schemes to recover your data. But there are several other | mechanisms by which a SSD's performance will degrade early | in its lifetime depending on the workload. Those | performance degradations are avoidable to some extent, and | are not permanent. | Blahah wrote: | So I can potentially recycle my used SSDs? | mehrdadn wrote: | I think you could give it a shot with ATA Secure Erasing | one of them and seeing if it performs faster. Although 4 | months at 50% utilization at (say) 2GB/s is some ~10PB of | I/O, so I'm not sure if I would expect what you're seeing | to be a temporary slowdown... | wtallis wrote: | Almost certainly. | | Assuming these are consumer SSD, the most important way | to maintain good performance is to ensure that it gets | some idle time. Consumer SSDs are optimized for burst | performance rather than sustained performance, and almost | all use SLC write caching. Depending on the drive and how | full it is, the SLC cache will be somewhere between a few | GB up to about a fourth of the advertised capacity. You | may be filling up the cache if you write 20GB in one | shot, but the drive will flush that cache in the | background over the span of a minute or two at most if | you don't keep it too busy. | | The other good strategy to maintain SSD performance in | the face of a heavy write workload is to not let the | drive get full. Reserving an extra 10-15% of the drive's | capacity and simply not touching it will significantly | improve sustained write speeds. (Most enterprise SSD | product lines have versions that already do this; a 3.2TB | drive and a 3.84TB drive are usually identical hardware | but configured with different amounts of spare area.) | | If a drive has already been pushed into a degraded | performance state, then you can either erase the whole | drive or, if your OS makes proper use of TRIM commands, | you can simply delete files to free up space. Then let | the drive have a few minutes to clean things up behind | the scenes. | [deleted] | porpoise wrote: | On the one hand, a new SSD a year sounds extreme. | | On the other hand, how many years does each of us have left? | Ten? twenty? Thirty? Forty? Few of us can easily imagine | ourselves still alive and productive in forty years. So much of | what we do rests on an implicit assumption that we are going to | live for eternity, and starts to seem pointless when we | consider how short our existence is. | aszen wrote: | Very well said, there are times we lose the bigger picture of | our lives and instead start wasting times with pointless | stuff just to escape the reality of our lives. | CyberRabbi wrote: | Materialism (the dominant underlying philosophy of our | culture) keeps us away from that higher level consciousness. | It poisons our mental models and worldview. | gravypod wrote: | It will highly vary depending on use case. I have been using | the same SSD (Samsung 850 evo) since 2015. First used on my | gaming desktop, then on my college laptop, now in my gaming | desktop again. I just make sure to keep it at ~25% to ~50% | capacity to give the controller an easy time and I try to stick | to mostly read only workloads (gaming). SMART report from that | drive: https://pastebin.com/raw/HyPE6aHm | | For my disk for my exact use case: ~4 years of operation. 88% | of lifespan remaining. | | Your mileage will almost definitely vary. | LandR wrote: | I'm still on the SSD I bought 6 or 7 years ago as my OS drive. | | Haven't noticed a single issue on it. | ggm wrote: | Can somebody please write up modern SSD and state of the world | regarding data retention, modes, applicability for SSD | replacing spinning rust "on the shelf" offline... | wtallis wrote: | SSDs make no sense for offline archival. They're more | expensive than hard drives and will be for the foreseeable | future. You don't need the improved random IO performance or | power efficiency for a drive that's mostly sitting on a | shelf. | CyberRabbi wrote: | When ssds fail they don't lose your data, they just become | unwritable. What you're doing is unnecessary and wasteful. | magicalhippo wrote: | I've only had two SSDs fail on me, and in both cases they | died without any warning. Didn't get discovered during boot | or anything. Two different brands, very different uses. | | So while they _can_ fail in a graceful way, that's not been | my experience. ___________________________________________________________________ (page generated 2020-11-26 23:00 UTC)