[HN Gopher] Achieving 11M IOPS and 66 GB/S IO on a Single Thread... ___________________________________________________________________ Achieving 11M IOPS and 66 GB/S IO on a Single ThreadRipper Workstation Author : tanelpoder Score : 231 points Date : 2021-01-29 12:45 UTC (10 hours ago) (HTM) web link (tanelpoder.com) (TXT) w3m dump (tanelpoder.com) | secondcoming wrote: | > For final tests, I even disabled the frequent gettimeofday | system calls that are used for I/O latency measurement | | I was knocking up some profiling code and measured the | performance of gettimeofday as a proof-of-concept test. | | The performance difference between running the test on my | personal desktop Linux VM versus running it on a cloud instance | Linux VM was quite interesting (cloud was worse) | | I think I read somewhere that cloud instances cannot use the VDSO | code path because your app may be moved to a different machine. | My recollection of the reason is somewhat cloudy. | ashkankiani wrote: | When I bought a bunch of NVME drives, I was disappointed with how | slow the maximum speed I could achieve with them was given my | knowledge and available time at the time. Thanks for making this | post to give me more points of insight into the problem. | | I'm on the same page with your thesis that "hardware is fast and | clusters are usually overkill," and disk I/O was a piece that I | hadn't really figured out yet despite making great strides in the | software engineering side of things. I'm trying to make a startup | this year and disk I/O will actually be a huge factor in how far | I can scale without bursting costs for my application. Good | stuff! | whalesalad wrote: | This post is fantastic. I wish there was more workstation porn | like this for those of us who are not into the RGB light show | ripjaw hacksaw aorus elite novelty stuff that gamers are so into. | Benchmarks in the community are almost universally focused on | gaming performance and FPS. | | I want to build an epic rig that will last a long time with | professional grade hardware (with ECC memory for instance) and | would love to get a lot of the bleeding-edge stuff without | compromising on durability. Where do these people hang out | online? | piinbinary wrote: | The level1techs forums seems to have a lot of people with | similar interests | greggyb wrote: | STH: https://www.youtube.com/user/ServeTheHomeVideo | https://www.servethehome.com/ | | GamersNexus (despite the name, they include a good amount of | non-gaming benchmarks, and they have great content on cases and | cooling): https://www.youtube.com/user/GamersNexus | https://www.gamersnexus.net/ | | Level1Techs (mentioned in another reply): | https://www.youtube.com/c/Level1Techs | https://www.level1techs.com/ | | r/homelab (and all the subreddits listed in its sidebar): | https://www.reddit.com/r/homelab/ | | Even LinusTechTips has some decent content for server hardware, | though they stay fairly superficial. And the forum definitely | has people who can help out: https://linustechtips.com/ | | And the thing is, depending on what metric you judge | performance by, the enthusiast hardware may very well | outperform the server hardware. For something that is sensitive | to memory, e.g., you can get much faster RAM in enthusiast SKUs | (https://www.crucial.com/memory/ddr4/BLM2K8G51C19U4B) than | you'll find in server hardware. Similarly, the HEDT SKUs out- | clock the server SKUs for both Intel and AMD. | | I have a Threadripper system that outperforms most servers I | work with on a daily basis, because most of my workloads, | despite being multi-threaded, are sensitive to clockspeed. | 1996 wrote: | Indeed, serious people now use gamer computer parts because | it's just faster! | greggyb wrote: | It's not "just faster". | | No one's using "gamer NICs" for high speed networking. Top | of the line "gaming" networking is 802.11ax or 10GbE. | 2x200Gb/s NICs are available now. | | Gaming parts are strictly single socket - software that can | take advantage of >64 cores will need server hardware - | either one of the giant Ampere ARM CPUs or a 2+ socket | system. | | If something must run in RAM and needs TB of RAM, well then | it's not even a question of faster or slower. The | capability only exists on server platforms. | | _Some_ workloads will benefit from the performance | characteristics of consumer hardware. | vmception wrote: | > RGB light show ripjaw hacksaw aorus elite novelty stuff | | haha yeah I bought a whole computer from someone and was | wondering why the RAM looked like rupies from Zelda | | apparently that is common now | | but at least I'm not cosplaying as a karate day trader for my | Wall Street Journal expose' | philsnow wrote: | I'm with you on this, I just built a (much more modest than the | article's) workstation/homelab machine a few months ago, to | replace my previous one which was going on 10 years old and | showing its age. | | There's some folks in /r/homelab who are into this kind of | thing, and I used their advice a fair bit in my build. While it | is kind of mixed (there's a lot of people who build pi clusters | as their homelab), there's still plenty of people who buy | decommissioned "enterprise" hardware and make monstrous-for- | home-use things. | deagle50 wrote: | Happy to help if you want feedback. Servethehome forums are | also a great resource of info and used hardware, probably the | best community for your needs. | arminiusreturns wrote: | Check out HardForum. Lots of very knowledgable people on there | helped me mature my hardware level knowledge. Back when I was | building 4 cpu, 64 core opteron systems. Also decent banter. | tanelpoder wrote: | Thanks! In case you're interested in building a ThreadRipper | Pro WX-based system like mine, then AMD apparently starts | selling the CPUs independently from March 2021 onwards: | | https://www.anandtech.com/show/16396/the-amd-wrx80-chipset-a... | | Previously you could only get this CPU when buying the Lenovo | ThinkStation P620 machine. I'm pretty happy with Lenovo | Thinkstations though (I bought a P920 with dual Xeons 2.5 years | ago) | ksec wrote: | And just in time article | | https://www.anandtech.com/show/16462/hands-on-with-the- | asus-... | | I guess I should submit this on HN as well. | | Edit: I was getting too ahead of myself I thought these are | for TR Pro with Zen 3. Turns out they are not out yet. | zhdc1 wrote: | Look at purchasing used enterprise hardware. You can buy a | reliable x9 or X10 generation supermicro server (rack or tower) | for around a couple of hundred. | ashkankiani wrote: | I've been planning to do this, but enterprise hardware seems | like it requires a completely different set of knowledge on | how to purchase it and maintain it, and especially as a | consumer. | | It's not quite as trivial of a barrier to entry as consumer | desktops, but I suppose that's the point. Still, it would be | nice if there was a guide that could help me make good | decisions to start. | jqcoffey wrote: | Also, purpose built data center chassis are designed for | high airflow and are thus really quite loud. | modoc wrote: | Very true. I have a single rack mount server in my HVAC | room, and it's still so loud I had to glue soundproofing | foam on the nearby walls:) | benlwalker wrote: | Plug for a post I wrote a few years ago demonstrating nearly the | same result but using only a single CPU core: | https://spdk.io/news/2019/05/06/nvme/ | | This is using SPDK to eliminate all of the overhead the author | identified. The hardware is far more capable than most people | expect, if the software would just get out of the way. | tanelpoder wrote: | Yes I had seen that one (even more impressive!) | | When I have more time again, I'll run fio with the SPDK plugin | on my kit too. And would be interested in seeing what happens | when doing 512B random I/Os? | benlwalker wrote: | The system that was tested there was PCIe bandwidth | constrained because this was a few years ago. With your | system, it'll get a bigger number - probably 14 or 15 million | 4KiB IO per second per core. | | But while SPDK does have an fio plug-in, unfortunately you | won't see numbers like that with fio. There's way too much | overhead in the tool itself. We can't get beyond 3 to 4 | million with that. We rolled our own benchmarking tool in | SPDK so we can actually measure the software we produce. | | Since the core is CPU bound, 512B IO are going to net the | same IO per second as 4k. The software overhead in SPDK is | fixed per IO, regardless of size. You can also run more | threads with SPDK than just one - it has no locks or cross | thread communication so it scales linearly with additional | threads. You can push systems to 80-100M IO per second if you | have disks and bandwidth that can handle it. | StillBored wrote: | Yah this has been going on for a while. Before SPDK it was | with custom kernel bypasses and fast inifiband/FC arrays. I | was involved with a similar project in the early 2000's. | Where at the time the bottleneck was the shared xeon bus, | and then it moved to the PCIe bus with opterons/nehalem+. | In our case we ended up spending a lot of time tuning the | application to avoid cross socket communication as well | since that could become a big deal (of course after careful | card placement). | | But SPDK has a problem you don't have with bypasses and | uio_ring, in that it needs the IOMMU enabled, and that can | itself become a bottleneck. There are also issues for some | applications that want to use interrupts rather than poll | everything. | | Whats really nice about uio_ring is that it sort of | standardizes a large part of what people were doing with | bypasses. | tanelpoder wrote: | Yeah, that's what I wondered - I'm ok with using multiple | cores, would I get even more IOPS when doing smaller I/Os. | Is the benchmark suite you used part of the SPDK toolkit | (and easy enough to run?) | benlwalker wrote: | Whether you get more IOPs with smaller I/Os depends on a | number of things. Most drives these days are natively | 4KiB blocks and are emulating 512B sectors for backward | compatibility. This emulation means that 512B writes are | often quite slow - probably slower than writing 4KiB | (with 4KiB alignment). But 512B reads are typically very | fast. On Optane drives this may not be true because the | media works entirely differently - those may be able to | do native 512B writes. Talk to the device vendor to get | the real answer. | | For at least reads, if you don't hit a CPU limit you'll | get 8x more IOPS with 512B than you will with 4KiB with | SPDK. It's more or less perfect scaling. There's some | additional hardware overheads in the MMU and PCIe | subsystems with 512B because you're sending more messages | for the same bandwidth, but my experience has been that | it is mostly negligible. | | The benchmark builds to build/examples/perf and you can | just run it with -h to get the help output. Random 4KiB | reads at 32 QD to all available NVMe devices (all devices | unbound from the kernel and rebound to vfio-pci) for 60 | seconds would be something like: | | perf -q 32 -o 4096 -w randread -t 60 | | You can specify only test specific devices with the -r | parameter (by BUS:DEVICE:FUNCTION essentially). The tool | can also benchmark kernel devices. Using -R will turn on | io_uring (otherwise it uses libaio), and you simply list | the block devices on the command line after the base | options like this: | | perf -q 32 -o 4096 -w randread -t 60 -R /dev/nvme0n1 | | You can get ahold of help from the SPDK community at | https://spdk.io/community. There will be lots of people | willing to help. | | Excellent post by the way. I really enjoyed it. | tanelpoder wrote: | Thanks! Will add this to TODO list too. | rektide wrote: | Nice follow up @ttanelpoder to "RAM is the new disk" (2015)[1] | which we talked about not even two weeks ago! | | I was quite surprised to hear in that thread that AMD's | infiniband was so oversubscribed. There's 256GBps of pcie on a 1P | butit seems like this 66GBps is all the fabric can do. A little | under a 4:1 oversubscription! | | [1] https://news.ycombinator.com/item?id=25863093 | electricshampo1 wrote: | 66GBps is from each of 10 drives doing ~6.6 GBps; don't think | the infinity fabric is the limiter here | [deleted] | muro wrote: | This article was great, thanks for sharing! | | Anyone has advice on optimizing a windows 10 system? I have a | haswell workstation (E5-1680 v3) that I find reasonably fast and | works very well under Linux. In windows, I get lost. I tried to | run the userbenchark suite which told me I'm below median for | most of my components. Is there any good advice how to improve | that? Which tools give good insight into what the machine is | doing under windows? I'd like first to try to optimize what I | have, before upgrading to the new shiny :). | RobLach wrote: | Excellent article. Worth a read even if you're not maxing IO. | wiradikusuma wrote: | I've been thinking about this. Would traditional co-location | (e.g. 2x 2U from DELL) in a local data center be cheaper if e.g. | you're serving local (country-wise) market? | derefr wrote: | Depends on how long you need the server, and the ownership | model you've chosen to pursue for it. | | If you _purchase_ a server and stick it in a co-lo somewhere, | and your business plans to exist for 10+ years -- well, is that | server still going to be powering your business 10 years from | now? Or will you have moved its workloads to something newer? | If so, you 'll probably want to decommission and sell the | server at some point. The time required to deal with that might | not be worth the labor costs of your highly-paid engineers. | Which means you might not actually end up re-capturing the | depreciated value of the server, but instead will just let it | rot on the shelf, or dispose of it as e-waste. | | Hardware _leasing_ is a lot simpler. Whether you lease servers | from an OEM like Dell, there 's a quick, well-known path to | getting the EOLed hardware shipped back to Dell and the | depreciated value paid back out to you. | | And, of course, hardware _renting_ is simpler still. Renting | the hardware of the co-lo (i.e. "bare-metal unmanaged server" | hosting plans) means never having to worry about the CapEx of | the hardware in the first place. You just walk away at the end | of your term. But, of course, that's when you start paying | premiums on top of the hardware. | | Renting VMs, then, is like renting hardware on a micro-scale; | you never have to think about what you're running on, as -- | presuming your workload isn't welded to particular machine | features like GPUs or local SSDs -- you'll tend to | automatically get migrated to newer hypervisor hardware | generations as they become available. | | When you work it out in terms of "ten years of ops-staff labor | costs of dealing with generational migrations and sell-offs" | vs. "ten years of premiums charged by hosting rentiers", the | pricing is surprisingly comparable. (In fact, this is basically | the math hosting providers use to figure out what they _can_ | charge without scaring away their large enterprise customers, | who are fully capable of taking a better deal if there is one.) | rodgerd wrote: | > If you purchase a server and stick it in a co-lo somewhere, | and your business plans to exist for 10+ years -- well, is | that server still going to be powering your business 10 years | from now? Or will you have moved its workloads to something | newer? | | Which, if you have even the remotest fiscal competence, | you'll have funded by using the depreciation of the book | value of the asset after 3 years. | 37ef_ced3 wrote: | Somebody please tell me how many ResNet50 inferences you can do | per second on one of these chips | | Here is the standalone AVX-512 ResNet50 code (C99 .h and .c | files): | | https://nn-512.com/browse/ResNet50 | | Oops, AMD doesn't support AVX-512 yet. Even Zen 3? Incredible | wyldfire wrote: | Whoa, this code looks interesting. Must've been emitted by | something higher-level? Something like PyTorch/TF/MLIR/TVM/Glow | maybe? | | If that is the case, then maybe it could be emitted again while | masking the instruction sets Ryzen doesn't support yet. | tanelpoder wrote: | You mean on the CPU, right? This CPU doesn't support AVX-512: | $ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ | /\n/g' | egrep "avx|sse|popcnt" | sort | uniq avx | avx2 misalignsse popcnt sse sse2 | sse4_1 sse4_2 sse4a ssse3 | | What compile/build options should I use? | 37ef_ced3 wrote: | No AVX-512, forget it then | xxpor wrote: | They don't have avx512 instructions. | qaq wrote: | Now honestly say for how long two boxes like this behind a load | balancer would be more than enough for your startup. | pbalcer wrote: | What I find interesting about the performance of this type of | hardware is how it affects the software we are using for storage. | The article talked about how the Linux kernel just can't keep up, | but what about databases or kv stores. Are the trade-offs those | types of solutions make still valid for this type of hardware? | | RocksDB, and LSM algorithms in general, seem to be designed with | the assumption that random block I/O is slow. It appears that, | for modern hardware, that assumption no longer holds, and the | software only slows things down [0]. | | [0] - | https://github.com/BLepers/KVell/blob/master/sosp19-final40.... | ddorian43 wrote: | Disappointed there was no lmdb comparison in there. | tyingq wrote: | A paper on making LSM more SSD friendly: | https://users.cs.duke.edu/~rvt/ICDE_2017_CameraReady_427.pdf | pbalcer wrote: | Thanks for sharing this article - I found it very insightful. | I've seen similar ideas being floated around before, and they | often seem to focus on what software can be added on top of | an already fairly complex solution (while LSM can appear to | be conceptually simple, its implementations are anything | but). | | To me, what the original article shows is an opportunity to | remove - not add. | jeffbee wrote: | If you think about it from the perspective of the authors of | large-scale databases, linear access is still a lot cheaper | than random access in a datacenter filesystem. | AtlasBarfed wrote: | scylladb had a blogpost once about how surprisingly small | amounts of cpu time are available to process packets at the | modern highest speed networks like 40gbit and the like. | | I can't find it now. I think they were trying to say that | cassandra can't keep up because of the JVM overhead and you | need to be close to metal for extreme performance. | | This is similar. Huge amounts of flooding I/O from modern PCIx | SSDs really closes the traditional gap between CPU and "disk". | | The biggest limiter in cloud right now is the EBS/SAN. Sure you | can use local storage in AWS if you don't mind it disappearing, | but while gp3 is an improvement, it pales to stuff like this. | | Also, this is fascinating: | | "Take the write speeds with a grain of salt, as TLC & QLC cards | have slower multi-bit writes into the main NAND area, but may | have some DIMM memory for buffering writes and/or a "TurboWrite | buffer" (as Samsung calls it) that uses part of the SSDs NAND | as faster SLC storage. It's done by issuing single-bit "SLC- | like" writes into TLC area. So, once you've filled up the "SLC" | TurboWrite buffer at 5000 MB/s, you'll be bottlenecked by the | TLC "main area" at 2000 MB/s (on the 1 TB disks)." | | I didn't know controllers could swap between TLC/QLC and SLC. | tanelpoder wrote: | I learned the last bit from here (Samsung Solid State Drive | TurboWrite Technology pdf): | | https://images-eu.ssl-images- | amazon.com/images/I/914ckzwNMpS... | StillBored wrote: | Yes a number of articles about these newer TLC drives talk | about it. The end result is that an empty drive is going to | benchmark considerably different from one 99% full of | uncompressable files. | | for example: | | https://www.tomshardware.com/uk/reviews/intel- | ssd-660p-qlc-n... | 1996 wrote: | > I didn't know controllers could swap between TLC/QLC and | SLC. | | I wish I could control the % of SLC. Even dividing a QLC | space by 16 makes it cheaper than buying a similarly sized | SLC | 1MachineElf wrote: | Reminds me of the Solid-State Drive checkbox that VirtualBox | has for any VM disks. Checking it will make sure that the VM | hardware emulation doesn't wait for the filesystem journal to | be written, which would normally be advisable with spinning | disks. | digikata wrote: | Not only the assumptions at the application layer, but | potentially the filesystem too. | [deleted] | bob1029 wrote: | I have personally found that making even the most primitive | efforts at single-writer principle and batching IO in your | software can make many orders of magnitude difference. | | Saturating an NVMe drive with a single x86 thread is trivial if | you change how you play the game. Using async/await and | yielding to the OS is not going to cut it anymore. Latency with | these drives is measured in microseconds. You are better off | doing microbatches of writes (10-1000 uS wide) and pushing | these to disk with a single thread that monitors a queue in a | busy wait loop (sort of like LMAX Disruptor but even more | aggressive). | | Thinking about high core count parts, sacrificing an entire | thread to busy waiting so you can write your transactions to | disk very quickly is not a terrible prospect anymore. This same | ideology is also really useful for ultra-precise execution of | future timed actions. Approaches in managed lanaguages like | Task.Delay or even Thread.Sleep are insanely inaccurate by | comparison. The humble while(true) loop is certainly not energy | efficient, but it is very responsive and predictable as long as | you dont ever yield. What's one core when you have 63 more to | go around? | pbalcer wrote: | The authors of the article I linked to earlier came to the | same conclusions. And so did the SPDK folks. And the kernel | community (or axboe :)) when coming up with io_uring. I'm | just hoping that we will see software catching up. | mikepurvis wrote: | Isn't the use or non-use of async/await a bit orthogonal to | the rest of this? | | I'm not an expert in this area, but wouldn't it be just as | lightweight to have your async workers pushing onto a queue, | and then have your async writer only wake up when the queue | is at a certain level to create the batched write? Either | way, you won't be paying the OS context switching costs | associated with blocking a write thread, which I think is | most of what you're trying to get out of here. | pbalcer wrote: | Right, I agree. I'd go even further and say that | async/await is a great fit for a modern _asynchronous_ I /O | stack (not read()/write()). Especially with io_uring using | polled I/O (the worker thread is in the kernel, all the | async runtime has to do is check for completion | periodically), or with SPDK if you spin up your own I/O | worker thread(s) like @benlwalker explained elsewhere in | the thread. | tyingq wrote: | I wonder if "huge pages" would make a difference, since some of | the bottlenecks seemed to be lock contention on memory pages. | tanelpoder wrote: | Linux pagecache doesn't use hugepages, but definitely when | doing direct I/O into application buffers, would make sense to | use hugepages for that. I plan to run tests on various database | engines next - and many of them support using hugepages (for | shared memory areas at least). | guerby wrote: | In the networking world (DPDK) huge pages and static pinning | everything is a huge deal as you have very few cpu cycles per | network packet. | tanelpoder wrote: | Yep - and there's SPDK for direct NVMe storage access | without going through the Linux block layer: | https://spdk.io | | (it's in my TODO list too) | tyingq wrote: | Thanks! Apparently, they did add it for tmpfs, and discussed | it for ext4. https://lwn.net/Articles/718102/ | tanelpoder wrote: | Good point - something to test, once I get to the | filesystem benchmarks! | tyingq wrote: | I'm somewhat curious what happens to the long standing 4P/4U | servers from companies like Dell and HP. The Ryzen/EPYC has | really made going past 2P/2U a more rare need. | thinkingkong wrote: | You might be able to buy a smaller server but the rack density | doesnt necessarily change. You still have to worry about | cooling and power so lots of DCs would have 1/4 or 1/2 racks. | tyingq wrote: | Sure. I wasn't really thinking of density, just the | interesting start of the "death" of 4 socket servers. Being | an old-timer, it's interesting to me because "typical | database server" has been synonymous with 4P/4U for a long, | long time. | vinay_ys wrote: | I haven't seen a 4 socket machine in a long time. | wtallis wrote: | I think at this point the only reasons to go beyond 2U are to | make room for either 3.5" hard drives, or GPUs. | rektide wrote: | Would love to see some very dense blade style ryzen | offerings. The 4 2P nodes in 2U is great. Good way to share | some power supies, fan, chassis, ideally multi-home nic too. | | Turn those sleds into blades though, put em on their side, & | go even denser. It should be a way to save costs, but density | alas is a huge upsell, even though it should be a way to | scale costs down. | tanelpoder wrote: | Indeed, 128 EPYC cores in 2 sockets (with total 16 memory | channels) will give a lot of power. I guess it's worth | mentioning that the 64-core chips have much lower clock rate | than 16/32 core ones though. And with some expensive software | that's licensed by CPU core (Oracle), you'd want faster cores, | but possibly pay a higher NUMA price when going with a single 4 | or 8 sockets machine for your "sacred monolith". | StillBored wrote: | There always seems to be buyers for more exotic high end | hardware. That market has been shrinking and expanding, well | since the first computer, as mainstream machines become more | capable and people discover more uses for large coherent | machines. | | But users of 16 socket machines, will just step down to 4 | socket epyc machines with 512 cores (or whatever). And someone | else will realize that moving their "web scale" cluster from 5k | machines, down to a single machine with 16 sockets results in | lower latency and less cost. (or whatever). | anarazel wrote: | Have you checked if using the fio options (--iodepth_batch_*) to | batch submissions helps? Fio doesn't do that by default, and I | found that that can be a significant benefit. | | Particularly submitting multiple up requests can amortize the | cost of setting the nvme doorbell (the expensive part as far as I | understand it) across multiple requests. | tanelpoder wrote: | I tested various fio options, but didn't notice this one - I'll | check it out! It might explain why I still kept seeing lots of | interrupts raised even though I had enabled the I/O completion | polling instead, with io_uring's --hipri option. | | edit: I ran a quick test with various IO batch sizes and it | didn't make a difference - I guess because thanks to using | io_uring, my bottleneck is not in IO submission, but deeper in | the block IO stack... | wtallis wrote: | I think on recent kernels, using the hipri option doesn't get | you interrupt-free polled IO unless you've configured the | nvme driver to allocate some queues specifically for polled | IO. Since these Samsung drives support 128 queues and you're | only using a 16C/32T processor, you have more than enough for | each drive to have one poll queue and one regular IO queue | allocated to each (virtual) CPU core. | tanelpoder wrote: | That would explain it. Do you recommend any docs/links I | should read about allocating queues for polled IO? | anarazel wrote: | It's terribly documented :(. You need to set the | nvme.poll_queues to the number of queues you want, before | the disks are attached. I.e. either at boot, or you need | to set the parameter and then cause the NVMe to be | rescanned (you can do that in sysfs, but I can't | immediately recall the steps with high confidence). | anarazel wrote: | Ah, yes, shell history ftw. Of course you should ensure | no filesystem is mounted or such: | root@awork3:~# echo 4 > | /sys/module/nvme/parameters/poll_queues | root@awork3:~# echo 1 > | /sys/block/nvme1n1/device/reset_controller | root@awork3:~# dmesg -c [749717.253101] nvme | nvme1: 12/0/4 default/read/poll queues | root@awork3:~# echo 8 > | /sys/module/nvme/parameters/poll_queues | root@awork3:~# dmesg -c root@awork3:~# echo 1 > | /sys/block/nvme1n1/device/reset_controller | root@awork3:~# dmesg -c [749736.513102] nvme | nvme1: 8/0/8 default/read/poll queues | tanelpoder wrote: | Thanks for the pointers, I'll bookmark this and will try | it out someday. | anarazel wrote: | > I tested various fio options, but didn't notice this one - | I'll check it out! It might explain why I still kept seeing | lots of interrupts raised even though I had enabled the I/O | completion polling instead, with io_uring's --hipri option. | | I think that should be independent. | | > edit: I ran a quick test with various IO batch sizes and it | didn't make a difference - I guess because thanks to using | io_uring, my bottleneck is not in IO submission, but deeper | in the block IO stack... | | It probably won't get you drastically higher speeds in an | isolated test - but it should help reduce CPU overhead. E.g. | on one of my SSDs fio --ioengine io_uring --rw randread | --filesize 50GB --invalidate=0 --name=test --direct=1 --bs=4k | --numjobs=1 --registerfiles --fixedbufs --gtod_reduce=1 | --iodepth 48 uses about 25% more CPU than when I add | --iodepth_batch_submit=0 --iodepth_batch_complete_max=0. But | the resulting iops are nearly the same as long as there are | enough cycles available. | | This is via filesystem, so ymmv, but the mechanism should be | mostly independent. | tanelpoder wrote: | Author here: This article was intended to explain some modern | hardware bottlenecks (and non-bottlenecks), but unexpectedly | ended up covering a bunch of Linux kernel I/O stack issues as | well :-) AMA | jeffbee wrote: | Great article, I learned! Can you tell me if you looked into | aspects of the NVMe device itself, such as whether it supports | 4K logical blocks instead of 512B? Use `nvme id-ns` to read out | the supported logical block formats. | tanelpoder wrote: | Doesn't seem to support 4k out of the box? Some drives - like | Intel Optane SSDs allow changing this in firmware (and | reformatting) with a manufacturer's utility... | $ lsblk -t /dev/nvme0n1 NAME ALIGNMENT MIN-IO OPT-IO | PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME nvme0n1 | 0 512 0 512 512 0 none 1023 128 0B | $ sudo nvme id-ns -H /dev/nvme0n1 | grep Size LBA | Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - | Relative Performance: 0 Best (in use) | jeffbee wrote: | Thanks for checking. SSD review sites never mention this | important detail. For some reason the Samsung datacenter | SSDs support 4K LBA format, and they are very similar to | the retail SSDs which don't seem to. I have the a retail | 970 Evo that only provides 512. | wtallis wrote: | I just checked my logs, and none of Samsung's consumer | NVMe drives have ever supported sector sizes other than | 512B. They seem to view this feature as part of their | product segmentation strategy. | | Some consumer SSD vendors do enable 4kB LBA support. I've | seen it supported on consumer drives from WD, SK hynix | and a variety of brands using Phison or SMI SSD | controllers (including Kingston, Seagate, Corsair, | Sabrent). But I haven't systematically checked to see | which brands consistently support it. | 1996 wrote: | Is it genuine 512? | | As in, what ashift value do you use with zfs? | wtallis wrote: | Regardless of what sector size you configure the SSD to | expose, the drive's flash translation layer still manages | logical to physical mappings at a 4kB granularity, the | underlying media page size is usually on the order of | 16kB, and the erase block size is several MB. So what | ashift value you want to use depends very much on what | kind of tradeoffs you're okay with in terms of different | aspects of performance and write endurance/write | amplification. But for most flash-based SSDs, there's no | reason to set ashift to anything less than 12 | (corresponding to 4kB blocks). | guerby wrote: | Here is an article about nvme-cli tool : | | https://nvmexpress.org/open-source-nvme-management- | utility-n... | | On Samsung SSD 970 EVO 1TB it seems only 512 bytes LBA are | supported: # nvme id-ns /dev/nvme0n1 -n 1 | -H|grep "^LBA Format" LBA Format 0 : Metadata Size: | 0 bytes - Data Size: 512 bytes - Relative Performance: 0 | Best (in use) | rafaelturk wrote: | Thanks for well written article, makes me think about | inefficiencies in our over-hyped cloud environment. | tanelpoder wrote: | Oh yes - and incorrectly configured on-premises systems too! | sitkack wrote: | Could you explain some of your thought processes and | methodologies when approaching problems like this? | | What is your mental model like? How much experimentation do you | do verses reading kernel code? How do you know what questions | to start asking? | | *edit, btw I understand that a response to these questions | could be an entire book, you get the question-space. | tanelpoder wrote: | Good question. I don't ever read kernel code as a starting | point, only if some profiling or tracing tool points me | towards an interesting function or codepath. And interesting | usually is something that takes most CPU in perf output or | some function call with an unusually high latency in ftrace, | bcc/bpftrace script output. Or just a stack trace in a core- | or crashdump. | | As far as mindset goes - I try to apply the developer mindset | to system performance. In other words, I don't use much of | what I call the "old school sysadmin mindset", from a time | where better tooling was not available. I don't use | systemwide utilization or various get/hit ratios for doing | "metric voodoo" of Unix wizards. | | The developer mindset dictates that everything you run is an | application. JVM is an application. Kernel is an application. | Postgres, Oracle are applications. All applications execute | one or more threads that run on CPU or do not run on CPU. | There are only two categories of reasons why a thread does | not run on CPU (is sleeping): The OS put the thread to sleep | (involuntary blocking) or the thread voluntarily wanted to go | to go to sleep (for example, it realized it can't get some | application level lock). | | And you drill down from there. Your OS/system is just a bunch | of threads running on CPU, sleeping and sometimes | communicating with each other. You can _directly_ measure all | of these things easily nowadays with profilers, no need for | metric voodoo. | | I have written my own tools to complement things like perf, | ftrace and BPF stuff - as a consultant I regularly see 10+ | year old Linux versions, etc - and I find sampling thread | states from /proc file system is a really good (and flexible) | starting point for system performance analysis and even some | drilldown - all this without having to install new software | or upgrading to latest kernels. Some of the tools I showed in | my article too: | | https://tanelpoder.com/psnapper & https://0x.tools | | In the end of my post I mentioned that I'll do a webinar | "hacking session" next Thursday, I'll show more how I work | there :-) | vinay_ys wrote: | Very cool rig and benchmark. Kudos. Request: add network io | load to your benchmarking load while nvme io load is running. | tanelpoder wrote: | Thanks, will do in a future article! I could share the disks | out via NFS or iSCSI or something and hammer them from a | remote machine... | PragmaticPulp wrote: | This is a great article. Thanks for writing it up and sharing. | guerby wrote: | 71 GB/s is 568 Gbit/s so you'll need about 3 dual 100 Gbit/s | cards to pump data out at the rate you can read it from the | NVMe drives. | | And ethernet (unless LAN jumbo frames) is about 1.5kByte per | frame (not 4kB). | | One such PC should be able to do 100k simultaneous 5 Mbps HD | streams. | | Testing this would be fun :) | zamadatix wrote: | Mellanox has a 2x200 Gbps NIC these days. Haven't gotten to | play with it yet though. | tanelpoder wrote: | Which NICs would you recommend for me to buy for testing at | least 1x100 Gbps (ideally 200 Gbps?) networking between | this machine (PCIe 4.0) and an Intel Xeon one that I have | with PCIe 3.0. Don't want to spend much money, so the cards | don't need to be too enterprisey, just fast. | | And - do such cards even allow direct "cross" connection | without a switch in between? | drewg123 wrote: | All 100G is enterprisy. | | For a cheap solution, I'd get a pair of used Mellanox | ConnectX4 or Chelsio T6, and a QSFP28 direct attach | copper cable. | zamadatix wrote: | +1 on what the sibling comment said. | | As for directly connecting them absolutely, works great. | Id recommend a cheap DAC off fs.com to connect them in | that case. | drewg123 wrote: | At Netflix, I'm playing with an EPYC 7502P with 16 NVME and | dual 2x100 Mellanox ConnectX6-DX NICs. With hardware kTLS | offload, we're able to serve about 350Gb/s of real customer | traffic. This goes down to about 240Gb/s when using software | kTLS, due to memory bandwidth limits. | | This is all FreeBSD, and is the evolution of the work | described in my talk at the last EuroBSDCon in 2019: | https://papers.freebsd.org/2019/eurobsdcon/gallatin- | numa_opt... | ksec wrote: | >we're able to serve about 350Gb/s of real customer | traffic. | | I still remember the post about breaking 100Gbps barrier, | that was may be in 2016 or 17 ? And wasn't that long ago it | was 200Gbps and if I remember correct it was hitting memory | bandwidth barrier as well. | | And now 350Gbps?! | | So what's next? Wait for DDR5? Or moving to some memory | controller black magic like POWER10? | drewg123 wrote: | Yes, before hardware inline kTLS offload, we were limited | to 200Gb/s or so with Naples. With Rome, its a bit | higher. But hardware inline kTLS with the Mellanox CX6-DX | eliminates memory bandwidth as a bottleneck. | | The current bottleneck is IO related, and its unclear | what the issue is. We're working with the hardware | vendors to try to figure it out. We should be getting | about 390Gb/s | ksec wrote: | Oh wow! Cant wait to hear more about it. | tanelpoder wrote: | I should (finally) receive my RTX 3090 card today (PCIe 4.0 | too!), I guess here goes my weekend (and the following | weekends over a couple of years)! | tarasglek wrote: | You should look at cpu usage. There is a good chance all your | interrupts are hitting cpu-0. you can run hwloc to see what | chiplet the pci cards are on and handle interrupts on those | cores. | jeffbee wrote: | Why would that happen with the linux nvme stack that puts a | completion queue on each CPU? | wtallis wrote: | I think that in addition to allocating a queue per CPU, you | need to be able to allocate a MSI(-X) vector per CPU. That | shouldn't be a problem for the Samsung 980 PRO, since it | supports 128 queues and 130 interrupt vectors. | tanelpoder wrote: | Thanks for the "hwloc" tip. I hadn't thought about that. | | I was thinking of doing something like that. Weirdly I got | sustained throughput differences when I killed & restarted | fio. So, if I got 11M IOPS, it stayed at that level until I | killed fio & restarted. If I got 10.8M next, it stayed like | it until I killed & restarted it. | | This makes me think that I'm hitting some PCIe/memory | bottleneck, dependent on process placement (which process | happens to need to move data across infinity fabric due to | accessing data through a "remote" PCIe root complex or | something like that). But then I realized that Zen 2 has a | central IO hub again, so there shouldn't be a "far edge of | I/O" like on current gen Intel CPUs (?) | | But there's definitely some workload placement and | I/O-memory-interrupt affinity that I've wanted to look into. | I could even enable the NUMA-like-mode from BIOS, but again | with Zen 2, the memory access goes through the central | infinity-fabric chip too, I understand, so not sure if | there's any value in trying to achieve memory locality for | individual chiplets on this platform (?) | wtallis wrote: | The PCIe is all on a single IO die, but internally it is | organized into quadrants that can produce some NUMA | effects. So it is probably worth trying out the motherboard | firmware settings to expose your CPU as multiple NUMA | nodes, and using the FIO options to allocate memory only on | the local node, and restricting execution to the right | cores. | tanelpoder wrote: | Yep, I enabled the "numa-like-awareness" in BIOS and ran | a few quick tests to see whether the NUMA-aware | scheduler/NUMA balancing would do the right thing and | migrate processes closer to their memory over time, but | didn't notice any benefit. But yep I haven't manually | locked down the execution and memory placement yet. This | placement may well explain why I saw some ~5% throughput | fluctuations _only if killing & restarting fio_ and not | while the same test was running. | syoc wrote: | I have done some tests on AMD servers and I the Linux | scheduler does a pretty good job. I do however get | noticeable (a couple percent) better performance by | forcing the process to run on the correct numa node. | | Make sure you get as many numa domains as possible in | your BIOS settings. | | I recommend using numactl with the cpu-exclusive and mem- | exclusive flags. I have noticed a slight perfomance drop | when RAM cache fills beyond the sticks local to the cpus | doing work. | | One last comment is that you mentioned interrupts being | "stiped" among CPUs. I would recommend pinning the | interrupts from one disk to one numa-local CPU and using | numactl to run fio for that disk on the same CPU. An | additional experiment is to, if you have enough cores, | pin interrupts to CPUs local to disk, but use other cores | on the same numa node for fio. That has been my most | successful setup so far. | ksec wrote: | I just love this article. Especially when the norm is always | about scaling out instead of scaling up. We can have 128 Core | CPU, 2TB Memory, PCI-E 4.0 SSD, ( and soon PCI-E 5.0 ). We | could even fit a _Petabyte_ in 1U for SSD Storage. | | I remember WhatsApp used to operate its _500M_ user with only a | dozen of large FreeBSD boxes. ( Only to be taken apart by | Facebook ) | | So Thank you for raising awareness. Hopefully the pendulum is | swinging back to conceptually simple design. | | >I also have a 380 GB Intel Optane 905P SSD for low latency | writes | | I would love to see that. Although I am waiting for someone to | do a review on the Optane SSD P5800X [1]. Random 4K IOPS up to | 1.5M with lower than 6 _us_ Latency. | | [1] https://www.servethehome.com/new-intel- | optane-p5800x-100-dwp... | texasbigdata wrote: | Second on Optane. | phkahler wrote: | >> I remember WhatsApp used to operate its 500M user with | only a dozen of large FreeBSD boxes. | | With 1TB of RAM you can have 256 bytes for every person on | earth live in memory. With SSD either as virtual memory or | keeping an index in RAM, you can do meaningful work in real | time, probably as fast as the network will allow. | rektide wrote: | Intel killing off prosumer optane 2 weeks ago[1] made me so | so so sad. | | The new P5800X should be sick. | | [1] https://news.ycombinator.com/item?id=25805779 | KaiserPro wrote: | Excellent write up. | | I used to work for a VFX company in 2008. At that point we used | lustre to get high throughput file storage. | | From memory we had something like 20 racks of server/disks to | get a 3-6 gigabyte/s (sustained) throughput on a 300tb | filesystem. | | It is hilarious to think that a 2u box can now theoretically | saturate 2x100gig nics. | qaq wrote: | Would be cool to see pgbench score for this setup | namero999 wrote: | You should be farming Chia on that thing [0] | | Amazing, congrats! | | [0] https://github.com/Chia-Network/chia-blockchain/wiki/FAQ | jayonsoftware1 wrote: | https://www.asus.com/us/Motherboard-Accessories/HYPER-M-2-X1... | vs https://highpoint-tech.com/USA_new/nvme_raid_controllers.htm . | One card is about x10 expensive, but looks like performance is | same. Am I missing some thing. | tanelpoder wrote: | The ASUS one doesn't have its own RAID controller nor PCIe | switch onboard. It relies on the motherboard-provided PCIe | bifurcation and if using hardware RAID, it'd use AMD's built-in | RAID solution (but I'll use software RAID via Linux dm/md). The | HighPoint SSD7500 seems to have a proprietary RAID controller | built in to it and some management/monitoring features too | (it's the "somewhat enterprisey" version) | wtallis wrote: | The HighPoint card doesn't have a hardware RAID controller, | just a PCIe switch and an option ROM providing boot support | for their software RAID. | | PCIe switch chips were affordable in the PCIe 2.0 era when | multi-GPU gaming setups were popular, but Broadcom decided to | price them out of the consumer market for PCIe 3 and later. | tanelpoder wrote: | Ok, thanks, good to know. I misunderstood from their | website. | rektide wrote: | pcie switches getting expensive is so the suck. | qaq wrote: | Now price this in terms of AWS and marvel at the markup | speedgoose wrote: | I'm afraid Jeff Bezos himself couldn't afford such IOs on AWS. | nwmcsween wrote: | So Linus was wrong on his rant to Dave about the page cache being | detremental on fast devices | ogrisel wrote: | As a nitpicking person, I really like to read a post that does | not confuse GB/s for GiB/s :) | | https://en.wikipedia.org/wiki/Byte#Multiple-byte_units | ogrisel wrote: | Actually now I realize that the title and the intro paragraph | are contradicting each other... | tanelpoder wrote: | Yeah, I used the formally incorrect GB in the title when I | tried to make it look as simple as possible... GiB just | didn't look as nice in the "marketing copy" :-) | | I may have missed using the right unit in some other sections | too. At least I hope that I've conveyed that there's a | difference! ___________________________________________________________________ (page generated 2021-01-29 23:00 UTC)