[HN Gopher] Achieving 11M IOPS and 66 GB/S IO on a Single Thread...
       ___________________________________________________________________
        
       Achieving 11M IOPS and 66 GB/S IO on a Single ThreadRipper
       Workstation
        
       Author : tanelpoder
       Score  : 231 points
       Date   : 2021-01-29 12:45 UTC (10 hours ago)
        
 (HTM) web link (tanelpoder.com)
 (TXT) w3m dump (tanelpoder.com)
        
       | secondcoming wrote:
       | > For final tests, I even disabled the frequent gettimeofday
       | system calls that are used for I/O latency measurement
       | 
       | I was knocking up some profiling code and measured the
       | performance of gettimeofday as a proof-of-concept test.
       | 
       | The performance difference between running the test on my
       | personal desktop Linux VM versus running it on a cloud instance
       | Linux VM was quite interesting (cloud was worse)
       | 
       | I think I read somewhere that cloud instances cannot use the VDSO
       | code path because your app may be moved to a different machine.
       | My recollection of the reason is somewhat cloudy.
        
       | ashkankiani wrote:
       | When I bought a bunch of NVME drives, I was disappointed with how
       | slow the maximum speed I could achieve with them was given my
       | knowledge and available time at the time. Thanks for making this
       | post to give me more points of insight into the problem.
       | 
       | I'm on the same page with your thesis that "hardware is fast and
       | clusters are usually overkill," and disk I/O was a piece that I
       | hadn't really figured out yet despite making great strides in the
       | software engineering side of things. I'm trying to make a startup
       | this year and disk I/O will actually be a huge factor in how far
       | I can scale without bursting costs for my application. Good
       | stuff!
        
       | whalesalad wrote:
       | This post is fantastic. I wish there was more workstation porn
       | like this for those of us who are not into the RGB light show
       | ripjaw hacksaw aorus elite novelty stuff that gamers are so into.
       | Benchmarks in the community are almost universally focused on
       | gaming performance and FPS.
       | 
       | I want to build an epic rig that will last a long time with
       | professional grade hardware (with ECC memory for instance) and
       | would love to get a lot of the bleeding-edge stuff without
       | compromising on durability. Where do these people hang out
       | online?
        
         | piinbinary wrote:
         | The level1techs forums seems to have a lot of people with
         | similar interests
        
         | greggyb wrote:
         | STH: https://www.youtube.com/user/ServeTheHomeVideo
         | https://www.servethehome.com/
         | 
         | GamersNexus (despite the name, they include a good amount of
         | non-gaming benchmarks, and they have great content on cases and
         | cooling): https://www.youtube.com/user/GamersNexus
         | https://www.gamersnexus.net/
         | 
         | Level1Techs (mentioned in another reply):
         | https://www.youtube.com/c/Level1Techs
         | https://www.level1techs.com/
         | 
         | r/homelab (and all the subreddits listed in its sidebar):
         | https://www.reddit.com/r/homelab/
         | 
         | Even LinusTechTips has some decent content for server hardware,
         | though they stay fairly superficial. And the forum definitely
         | has people who can help out: https://linustechtips.com/
         | 
         | And the thing is, depending on what metric you judge
         | performance by, the enthusiast hardware may very well
         | outperform the server hardware. For something that is sensitive
         | to memory, e.g., you can get much faster RAM in enthusiast SKUs
         | (https://www.crucial.com/memory/ddr4/BLM2K8G51C19U4B) than
         | you'll find in server hardware. Similarly, the HEDT SKUs out-
         | clock the server SKUs for both Intel and AMD.
         | 
         | I have a Threadripper system that outperforms most servers I
         | work with on a daily basis, because most of my workloads,
         | despite being multi-threaded, are sensitive to clockspeed.
        
           | 1996 wrote:
           | Indeed, serious people now use gamer computer parts because
           | it's just faster!
        
             | greggyb wrote:
             | It's not "just faster".
             | 
             | No one's using "gamer NICs" for high speed networking. Top
             | of the line "gaming" networking is 802.11ax or 10GbE.
             | 2x200Gb/s NICs are available now.
             | 
             | Gaming parts are strictly single socket - software that can
             | take advantage of >64 cores will need server hardware -
             | either one of the giant Ampere ARM CPUs or a 2+ socket
             | system.
             | 
             | If something must run in RAM and needs TB of RAM, well then
             | it's not even a question of faster or slower. The
             | capability only exists on server platforms.
             | 
             |  _Some_ workloads will benefit from the performance
             | characteristics of consumer hardware.
        
         | vmception wrote:
         | > RGB light show ripjaw hacksaw aorus elite novelty stuff
         | 
         | haha yeah I bought a whole computer from someone and was
         | wondering why the RAM looked like rupies from Zelda
         | 
         | apparently that is common now
         | 
         | but at least I'm not cosplaying as a karate day trader for my
         | Wall Street Journal expose'
        
         | philsnow wrote:
         | I'm with you on this, I just built a (much more modest than the
         | article's) workstation/homelab machine a few months ago, to
         | replace my previous one which was going on 10 years old and
         | showing its age.
         | 
         | There's some folks in /r/homelab who are into this kind of
         | thing, and I used their advice a fair bit in my build. While it
         | is kind of mixed (there's a lot of people who build pi clusters
         | as their homelab), there's still plenty of people who buy
         | decommissioned "enterprise" hardware and make monstrous-for-
         | home-use things.
        
         | deagle50 wrote:
         | Happy to help if you want feedback. Servethehome forums are
         | also a great resource of info and used hardware, probably the
         | best community for your needs.
        
         | arminiusreturns wrote:
         | Check out HardForum. Lots of very knowledgable people on there
         | helped me mature my hardware level knowledge. Back when I was
         | building 4 cpu, 64 core opteron systems. Also decent banter.
        
         | tanelpoder wrote:
         | Thanks! In case you're interested in building a ThreadRipper
         | Pro WX-based system like mine, then AMD apparently starts
         | selling the CPUs independently from March 2021 onwards:
         | 
         | https://www.anandtech.com/show/16396/the-amd-wrx80-chipset-a...
         | 
         | Previously you could only get this CPU when buying the Lenovo
         | ThinkStation P620 machine. I'm pretty happy with Lenovo
         | Thinkstations though (I bought a P920 with dual Xeons 2.5 years
         | ago)
        
           | ksec wrote:
           | And just in time article
           | 
           | https://www.anandtech.com/show/16462/hands-on-with-the-
           | asus-...
           | 
           | I guess I should submit this on HN as well.
           | 
           | Edit: I was getting too ahead of myself I thought these are
           | for TR Pro with Zen 3. Turns out they are not out yet.
        
         | zhdc1 wrote:
         | Look at purchasing used enterprise hardware. You can buy a
         | reliable x9 or X10 generation supermicro server (rack or tower)
         | for around a couple of hundred.
        
           | ashkankiani wrote:
           | I've been planning to do this, but enterprise hardware seems
           | like it requires a completely different set of knowledge on
           | how to purchase it and maintain it, and especially as a
           | consumer.
           | 
           | It's not quite as trivial of a barrier to entry as consumer
           | desktops, but I suppose that's the point. Still, it would be
           | nice if there was a guide that could help me make good
           | decisions to start.
        
             | jqcoffey wrote:
             | Also, purpose built data center chassis are designed for
             | high airflow and are thus really quite loud.
        
               | modoc wrote:
               | Very true. I have a single rack mount server in my HVAC
               | room, and it's still so loud I had to glue soundproofing
               | foam on the nearby walls:)
        
       | benlwalker wrote:
       | Plug for a post I wrote a few years ago demonstrating nearly the
       | same result but using only a single CPU core:
       | https://spdk.io/news/2019/05/06/nvme/
       | 
       | This is using SPDK to eliminate all of the overhead the author
       | identified. The hardware is far more capable than most people
       | expect, if the software would just get out of the way.
        
         | tanelpoder wrote:
         | Yes I had seen that one (even more impressive!)
         | 
         | When I have more time again, I'll run fio with the SPDK plugin
         | on my kit too. And would be interested in seeing what happens
         | when doing 512B random I/Os?
        
           | benlwalker wrote:
           | The system that was tested there was PCIe bandwidth
           | constrained because this was a few years ago. With your
           | system, it'll get a bigger number - probably 14 or 15 million
           | 4KiB IO per second per core.
           | 
           | But while SPDK does have an fio plug-in, unfortunately you
           | won't see numbers like that with fio. There's way too much
           | overhead in the tool itself. We can't get beyond 3 to 4
           | million with that. We rolled our own benchmarking tool in
           | SPDK so we can actually measure the software we produce.
           | 
           | Since the core is CPU bound, 512B IO are going to net the
           | same IO per second as 4k. The software overhead in SPDK is
           | fixed per IO, regardless of size. You can also run more
           | threads with SPDK than just one - it has no locks or cross
           | thread communication so it scales linearly with additional
           | threads. You can push systems to 80-100M IO per second if you
           | have disks and bandwidth that can handle it.
        
             | StillBored wrote:
             | Yah this has been going on for a while. Before SPDK it was
             | with custom kernel bypasses and fast inifiband/FC arrays. I
             | was involved with a similar project in the early 2000's.
             | Where at the time the bottleneck was the shared xeon bus,
             | and then it moved to the PCIe bus with opterons/nehalem+.
             | In our case we ended up spending a lot of time tuning the
             | application to avoid cross socket communication as well
             | since that could become a big deal (of course after careful
             | card placement).
             | 
             | But SPDK has a problem you don't have with bypasses and
             | uio_ring, in that it needs the IOMMU enabled, and that can
             | itself become a bottleneck. There are also issues for some
             | applications that want to use interrupts rather than poll
             | everything.
             | 
             | Whats really nice about uio_ring is that it sort of
             | standardizes a large part of what people were doing with
             | bypasses.
        
             | tanelpoder wrote:
             | Yeah, that's what I wondered - I'm ok with using multiple
             | cores, would I get even more IOPS when doing smaller I/Os.
             | Is the benchmark suite you used part of the SPDK toolkit
             | (and easy enough to run?)
        
               | benlwalker wrote:
               | Whether you get more IOPs with smaller I/Os depends on a
               | number of things. Most drives these days are natively
               | 4KiB blocks and are emulating 512B sectors for backward
               | compatibility. This emulation means that 512B writes are
               | often quite slow - probably slower than writing 4KiB
               | (with 4KiB alignment). But 512B reads are typically very
               | fast. On Optane drives this may not be true because the
               | media works entirely differently - those may be able to
               | do native 512B writes. Talk to the device vendor to get
               | the real answer.
               | 
               | For at least reads, if you don't hit a CPU limit you'll
               | get 8x more IOPS with 512B than you will with 4KiB with
               | SPDK. It's more or less perfect scaling. There's some
               | additional hardware overheads in the MMU and PCIe
               | subsystems with 512B because you're sending more messages
               | for the same bandwidth, but my experience has been that
               | it is mostly negligible.
               | 
               | The benchmark builds to build/examples/perf and you can
               | just run it with -h to get the help output. Random 4KiB
               | reads at 32 QD to all available NVMe devices (all devices
               | unbound from the kernel and rebound to vfio-pci) for 60
               | seconds would be something like:
               | 
               | perf -q 32 -o 4096 -w randread -t 60
               | 
               | You can specify only test specific devices with the -r
               | parameter (by BUS:DEVICE:FUNCTION essentially). The tool
               | can also benchmark kernel devices. Using -R will turn on
               | io_uring (otherwise it uses libaio), and you simply list
               | the block devices on the command line after the base
               | options like this:
               | 
               | perf -q 32 -o 4096 -w randread -t 60 -R /dev/nvme0n1
               | 
               | You can get ahold of help from the SPDK community at
               | https://spdk.io/community. There will be lots of people
               | willing to help.
               | 
               | Excellent post by the way. I really enjoyed it.
        
               | tanelpoder wrote:
               | Thanks! Will add this to TODO list too.
        
       | rektide wrote:
       | Nice follow up @ttanelpoder to "RAM is the new disk" (2015)[1]
       | which we talked about not even two weeks ago!
       | 
       | I was quite surprised to hear in that thread that AMD's
       | infiniband was so oversubscribed. There's 256GBps of pcie on a 1P
       | butit seems like this 66GBps is all the fabric can do. A little
       | under a 4:1 oversubscription!
       | 
       | [1] https://news.ycombinator.com/item?id=25863093
        
         | electricshampo1 wrote:
         | 66GBps is from each of 10 drives doing ~6.6 GBps; don't think
         | the infinity fabric is the limiter here
        
       | [deleted]
        
       | muro wrote:
       | This article was great, thanks for sharing!
       | 
       | Anyone has advice on optimizing a windows 10 system? I have a
       | haswell workstation (E5-1680 v3) that I find reasonably fast and
       | works very well under Linux. In windows, I get lost. I tried to
       | run the userbenchark suite which told me I'm below median for
       | most of my components. Is there any good advice how to improve
       | that? Which tools give good insight into what the machine is
       | doing under windows? I'd like first to try to optimize what I
       | have, before upgrading to the new shiny :).
        
       | RobLach wrote:
       | Excellent article. Worth a read even if you're not maxing IO.
        
       | wiradikusuma wrote:
       | I've been thinking about this. Would traditional co-location
       | (e.g. 2x 2U from DELL) in a local data center be cheaper if e.g.
       | you're serving local (country-wise) market?
        
         | derefr wrote:
         | Depends on how long you need the server, and the ownership
         | model you've chosen to pursue for it.
         | 
         | If you _purchase_ a server and stick it in a co-lo somewhere,
         | and your business plans to exist for 10+ years -- well, is that
         | server still going to be powering your business 10 years from
         | now? Or will you have moved its workloads to something newer?
         | If so, you 'll probably want to decommission and sell the
         | server at some point. The time required to deal with that might
         | not be worth the labor costs of your highly-paid engineers.
         | Which means you might not actually end up re-capturing the
         | depreciated value of the server, but instead will just let it
         | rot on the shelf, or dispose of it as e-waste.
         | 
         | Hardware _leasing_ is a lot simpler. Whether you lease servers
         | from an OEM like Dell, there 's a quick, well-known path to
         | getting the EOLed hardware shipped back to Dell and the
         | depreciated value paid back out to you.
         | 
         | And, of course, hardware _renting_ is simpler still. Renting
         | the hardware of the co-lo (i.e.  "bare-metal unmanaged server"
         | hosting plans) means never having to worry about the CapEx of
         | the hardware in the first place. You just walk away at the end
         | of your term. But, of course, that's when you start paying
         | premiums on top of the hardware.
         | 
         | Renting VMs, then, is like renting hardware on a micro-scale;
         | you never have to think about what you're running on, as --
         | presuming your workload isn't welded to particular machine
         | features like GPUs or local SSDs -- you'll tend to
         | automatically get migrated to newer hypervisor hardware
         | generations as they become available.
         | 
         | When you work it out in terms of "ten years of ops-staff labor
         | costs of dealing with generational migrations and sell-offs"
         | vs. "ten years of premiums charged by hosting rentiers", the
         | pricing is surprisingly comparable. (In fact, this is basically
         | the math hosting providers use to figure out what they _can_
         | charge without scaring away their large enterprise customers,
         | who are fully capable of taking a better deal if there is one.)
        
           | rodgerd wrote:
           | > If you purchase a server and stick it in a co-lo somewhere,
           | and your business plans to exist for 10+ years -- well, is
           | that server still going to be powering your business 10 years
           | from now? Or will you have moved its workloads to something
           | newer?
           | 
           | Which, if you have even the remotest fiscal competence,
           | you'll have funded by using the depreciation of the book
           | value of the asset after 3 years.
        
       | 37ef_ced3 wrote:
       | Somebody please tell me how many ResNet50 inferences you can do
       | per second on one of these chips
       | 
       | Here is the standalone AVX-512 ResNet50 code (C99 .h and .c
       | files):
       | 
       | https://nn-512.com/browse/ResNet50
       | 
       | Oops, AMD doesn't support AVX-512 yet. Even Zen 3? Incredible
        
         | wyldfire wrote:
         | Whoa, this code looks interesting. Must've been emitted by
         | something higher-level? Something like PyTorch/TF/MLIR/TVM/Glow
         | maybe?
         | 
         | If that is the case, then maybe it could be emitted again while
         | masking the instruction sets Ryzen doesn't support yet.
        
         | tanelpoder wrote:
         | You mean on the CPU, right? This CPU doesn't support AVX-512:
         | $ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/
         | /\n/g' | egrep "avx|sse|popcnt" | sort | uniq       avx
         | avx2       misalignsse       popcnt       sse       sse2
         | sse4_1       sse4_2       sse4a       ssse3
         | 
         | What compile/build options should I use?
        
           | 37ef_ced3 wrote:
           | No AVX-512, forget it then
        
         | xxpor wrote:
         | They don't have avx512 instructions.
        
       | qaq wrote:
       | Now honestly say for how long two boxes like this behind a load
       | balancer would be more than enough for your startup.
        
       | pbalcer wrote:
       | What I find interesting about the performance of this type of
       | hardware is how it affects the software we are using for storage.
       | The article talked about how the Linux kernel just can't keep up,
       | but what about databases or kv stores. Are the trade-offs those
       | types of solutions make still valid for this type of hardware?
       | 
       | RocksDB, and LSM algorithms in general, seem to be designed with
       | the assumption that random block I/O is slow. It appears that,
       | for modern hardware, that assumption no longer holds, and the
       | software only slows things down [0].
       | 
       | [0] -
       | https://github.com/BLepers/KVell/blob/master/sosp19-final40....
        
         | ddorian43 wrote:
         | Disappointed there was no lmdb comparison in there.
        
         | tyingq wrote:
         | A paper on making LSM more SSD friendly:
         | https://users.cs.duke.edu/~rvt/ICDE_2017_CameraReady_427.pdf
        
           | pbalcer wrote:
           | Thanks for sharing this article - I found it very insightful.
           | I've seen similar ideas being floated around before, and they
           | often seem to focus on what software can be added on top of
           | an already fairly complex solution (while LSM can appear to
           | be conceptually simple, its implementations are anything
           | but).
           | 
           | To me, what the original article shows is an opportunity to
           | remove - not add.
        
         | jeffbee wrote:
         | If you think about it from the perspective of the authors of
         | large-scale databases, linear access is still a lot cheaper
         | than random access in a datacenter filesystem.
        
         | AtlasBarfed wrote:
         | scylladb had a blogpost once about how surprisingly small
         | amounts of cpu time are available to process packets at the
         | modern highest speed networks like 40gbit and the like.
         | 
         | I can't find it now. I think they were trying to say that
         | cassandra can't keep up because of the JVM overhead and you
         | need to be close to metal for extreme performance.
         | 
         | This is similar. Huge amounts of flooding I/O from modern PCIx
         | SSDs really closes the traditional gap between CPU and "disk".
         | 
         | The biggest limiter in cloud right now is the EBS/SAN. Sure you
         | can use local storage in AWS if you don't mind it disappearing,
         | but while gp3 is an improvement, it pales to stuff like this.
         | 
         | Also, this is fascinating:
         | 
         | "Take the write speeds with a grain of salt, as TLC & QLC cards
         | have slower multi-bit writes into the main NAND area, but may
         | have some DIMM memory for buffering writes and/or a "TurboWrite
         | buffer" (as Samsung calls it) that uses part of the SSDs NAND
         | as faster SLC storage. It's done by issuing single-bit "SLC-
         | like" writes into TLC area. So, once you've filled up the "SLC"
         | TurboWrite buffer at 5000 MB/s, you'll be bottlenecked by the
         | TLC "main area" at 2000 MB/s (on the 1 TB disks)."
         | 
         | I didn't know controllers could swap between TLC/QLC and SLC.
        
           | tanelpoder wrote:
           | I learned the last bit from here (Samsung Solid State Drive
           | TurboWrite Technology pdf):
           | 
           | https://images-eu.ssl-images-
           | amazon.com/images/I/914ckzwNMpS...
        
           | StillBored wrote:
           | Yes a number of articles about these newer TLC drives talk
           | about it. The end result is that an empty drive is going to
           | benchmark considerably different from one 99% full of
           | uncompressable files.
           | 
           | for example:
           | 
           | https://www.tomshardware.com/uk/reviews/intel-
           | ssd-660p-qlc-n...
        
           | 1996 wrote:
           | > I didn't know controllers could swap between TLC/QLC and
           | SLC.
           | 
           | I wish I could control the % of SLC. Even dividing a QLC
           | space by 16 makes it cheaper than buying a similarly sized
           | SLC
        
         | 1MachineElf wrote:
         | Reminds me of the Solid-State Drive checkbox that VirtualBox
         | has for any VM disks. Checking it will make sure that the VM
         | hardware emulation doesn't wait for the filesystem journal to
         | be written, which would normally be advisable with spinning
         | disks.
        
         | digikata wrote:
         | Not only the assumptions at the application layer, but
         | potentially the filesystem too.
        
         | [deleted]
        
         | bob1029 wrote:
         | I have personally found that making even the most primitive
         | efforts at single-writer principle and batching IO in your
         | software can make many orders of magnitude difference.
         | 
         | Saturating an NVMe drive with a single x86 thread is trivial if
         | you change how you play the game. Using async/await and
         | yielding to the OS is not going to cut it anymore. Latency with
         | these drives is measured in microseconds. You are better off
         | doing microbatches of writes (10-1000 uS wide) and pushing
         | these to disk with a single thread that monitors a queue in a
         | busy wait loop (sort of like LMAX Disruptor but even more
         | aggressive).
         | 
         | Thinking about high core count parts, sacrificing an entire
         | thread to busy waiting so you can write your transactions to
         | disk very quickly is not a terrible prospect anymore. This same
         | ideology is also really useful for ultra-precise execution of
         | future timed actions. Approaches in managed lanaguages like
         | Task.Delay or even Thread.Sleep are insanely inaccurate by
         | comparison. The humble while(true) loop is certainly not energy
         | efficient, but it is very responsive and predictable as long as
         | you dont ever yield. What's one core when you have 63 more to
         | go around?
        
           | pbalcer wrote:
           | The authors of the article I linked to earlier came to the
           | same conclusions. And so did the SPDK folks. And the kernel
           | community (or axboe :)) when coming up with io_uring. I'm
           | just hoping that we will see software catching up.
        
           | mikepurvis wrote:
           | Isn't the use or non-use of async/await a bit orthogonal to
           | the rest of this?
           | 
           | I'm not an expert in this area, but wouldn't it be just as
           | lightweight to have your async workers pushing onto a queue,
           | and then have your async writer only wake up when the queue
           | is at a certain level to create the batched write? Either
           | way, you won't be paying the OS context switching costs
           | associated with blocking a write thread, which I think is
           | most of what you're trying to get out of here.
        
             | pbalcer wrote:
             | Right, I agree. I'd go even further and say that
             | async/await is a great fit for a modern _asynchronous_ I /O
             | stack (not read()/write()). Especially with io_uring using
             | polled I/O (the worker thread is in the kernel, all the
             | async runtime has to do is check for completion
             | periodically), or with SPDK if you spin up your own I/O
             | worker thread(s) like @benlwalker explained elsewhere in
             | the thread.
        
       | tyingq wrote:
       | I wonder if "huge pages" would make a difference, since some of
       | the bottlenecks seemed to be lock contention on memory pages.
        
         | tanelpoder wrote:
         | Linux pagecache doesn't use hugepages, but definitely when
         | doing direct I/O into application buffers, would make sense to
         | use hugepages for that. I plan to run tests on various database
         | engines next - and many of them support using hugepages (for
         | shared memory areas at least).
        
           | guerby wrote:
           | In the networking world (DPDK) huge pages and static pinning
           | everything is a huge deal as you have very few cpu cycles per
           | network packet.
        
             | tanelpoder wrote:
             | Yep - and there's SPDK for direct NVMe storage access
             | without going through the Linux block layer:
             | https://spdk.io
             | 
             | (it's in my TODO list too)
        
           | tyingq wrote:
           | Thanks! Apparently, they did add it for tmpfs, and discussed
           | it for ext4. https://lwn.net/Articles/718102/
        
             | tanelpoder wrote:
             | Good point - something to test, once I get to the
             | filesystem benchmarks!
        
       | tyingq wrote:
       | I'm somewhat curious what happens to the long standing 4P/4U
       | servers from companies like Dell and HP. The Ryzen/EPYC has
       | really made going past 2P/2U a more rare need.
        
         | thinkingkong wrote:
         | You might be able to buy a smaller server but the rack density
         | doesnt necessarily change. You still have to worry about
         | cooling and power so lots of DCs would have 1/4 or 1/2 racks.
        
           | tyingq wrote:
           | Sure. I wasn't really thinking of density, just the
           | interesting start of the "death" of 4 socket servers. Being
           | an old-timer, it's interesting to me because "typical
           | database server" has been synonymous with 4P/4U for a long,
           | long time.
        
             | vinay_ys wrote:
             | I haven't seen a 4 socket machine in a long time.
        
         | wtallis wrote:
         | I think at this point the only reasons to go beyond 2U are to
         | make room for either 3.5" hard drives, or GPUs.
        
           | rektide wrote:
           | Would love to see some very dense blade style ryzen
           | offerings. The 4 2P nodes in 2U is great. Good way to share
           | some power supies, fan, chassis, ideally multi-home nic too.
           | 
           | Turn those sleds into blades though, put em on their side, &
           | go even denser. It should be a way to save costs, but density
           | alas is a huge upsell, even though it should be a way to
           | scale costs down.
        
         | tanelpoder wrote:
         | Indeed, 128 EPYC cores in 2 sockets (with total 16 memory
         | channels) will give a lot of power. I guess it's worth
         | mentioning that the 64-core chips have much lower clock rate
         | than 16/32 core ones though. And with some expensive software
         | that's licensed by CPU core (Oracle), you'd want faster cores,
         | but possibly pay a higher NUMA price when going with a single 4
         | or 8 sockets machine for your "sacred monolith".
        
         | StillBored wrote:
         | There always seems to be buyers for more exotic high end
         | hardware. That market has been shrinking and expanding, well
         | since the first computer, as mainstream machines become more
         | capable and people discover more uses for large coherent
         | machines.
         | 
         | But users of 16 socket machines, will just step down to 4
         | socket epyc machines with 512 cores (or whatever). And someone
         | else will realize that moving their "web scale" cluster from 5k
         | machines, down to a single machine with 16 sockets results in
         | lower latency and less cost. (or whatever).
        
       | anarazel wrote:
       | Have you checked if using the fio options (--iodepth_batch_*) to
       | batch submissions helps? Fio doesn't do that by default, and I
       | found that that can be a significant benefit.
       | 
       | Particularly submitting multiple up requests can amortize the
       | cost of setting the nvme doorbell (the expensive part as far as I
       | understand it) across multiple requests.
        
         | tanelpoder wrote:
         | I tested various fio options, but didn't notice this one - I'll
         | check it out! It might explain why I still kept seeing lots of
         | interrupts raised even though I had enabled the I/O completion
         | polling instead, with io_uring's --hipri option.
         | 
         | edit: I ran a quick test with various IO batch sizes and it
         | didn't make a difference - I guess because thanks to using
         | io_uring, my bottleneck is not in IO submission, but deeper in
         | the block IO stack...
        
           | wtallis wrote:
           | I think on recent kernels, using the hipri option doesn't get
           | you interrupt-free polled IO unless you've configured the
           | nvme driver to allocate some queues specifically for polled
           | IO. Since these Samsung drives support 128 queues and you're
           | only using a 16C/32T processor, you have more than enough for
           | each drive to have one poll queue and one regular IO queue
           | allocated to each (virtual) CPU core.
        
             | tanelpoder wrote:
             | That would explain it. Do you recommend any docs/links I
             | should read about allocating queues for polled IO?
        
               | anarazel wrote:
               | It's terribly documented :(. You need to set the
               | nvme.poll_queues to the number of queues you want, before
               | the disks are attached. I.e. either at boot, or you need
               | to set the parameter and then cause the NVMe to be
               | rescanned (you can do that in sysfs, but I can't
               | immediately recall the steps with high confidence).
        
               | anarazel wrote:
               | Ah, yes, shell history ftw. Of course you should ensure
               | no filesystem is mounted or such:
               | root@awork3:~# echo 4 >
               | /sys/module/nvme/parameters/poll_queues
               | root@awork3:~# echo 1 >
               | /sys/block/nvme1n1/device/reset_controller
               | root@awork3:~# dmesg -c         [749717.253101] nvme
               | nvme1: 12/0/4 default/read/poll queues
               | root@awork3:~# echo 8 >
               | /sys/module/nvme/parameters/poll_queues
               | root@awork3:~# dmesg -c         root@awork3:~# echo 1 >
               | /sys/block/nvme1n1/device/reset_controller
               | root@awork3:~# dmesg -c         [749736.513102] nvme
               | nvme1: 8/0/8 default/read/poll queues
        
               | tanelpoder wrote:
               | Thanks for the pointers, I'll bookmark this and will try
               | it out someday.
        
           | anarazel wrote:
           | > I tested various fio options, but didn't notice this one -
           | I'll check it out! It might explain why I still kept seeing
           | lots of interrupts raised even though I had enabled the I/O
           | completion polling instead, with io_uring's --hipri option.
           | 
           | I think that should be independent.
           | 
           | > edit: I ran a quick test with various IO batch sizes and it
           | didn't make a difference - I guess because thanks to using
           | io_uring, my bottleneck is not in IO submission, but deeper
           | in the block IO stack...
           | 
           | It probably won't get you drastically higher speeds in an
           | isolated test - but it should help reduce CPU overhead. E.g.
           | on one of my SSDs fio --ioengine io_uring --rw randread
           | --filesize 50GB --invalidate=0 --name=test --direct=1 --bs=4k
           | --numjobs=1 --registerfiles --fixedbufs --gtod_reduce=1
           | --iodepth 48 uses about 25% more CPU than when I add
           | --iodepth_batch_submit=0 --iodepth_batch_complete_max=0. But
           | the resulting iops are nearly the same as long as there are
           | enough cycles available.
           | 
           | This is via filesystem, so ymmv, but the mechanism should be
           | mostly independent.
        
       | tanelpoder wrote:
       | Author here: This article was intended to explain some modern
       | hardware bottlenecks (and non-bottlenecks), but unexpectedly
       | ended up covering a bunch of Linux kernel I/O stack issues as
       | well :-) AMA
        
         | jeffbee wrote:
         | Great article, I learned! Can you tell me if you looked into
         | aspects of the NVMe device itself, such as whether it supports
         | 4K logical blocks instead of 512B? Use `nvme id-ns` to read out
         | the supported logical block formats.
        
           | tanelpoder wrote:
           | Doesn't seem to support 4k out of the box? Some drives - like
           | Intel Optane SSDs allow changing this in firmware (and
           | reformatting) with a manufacturer's utility...
           | $ lsblk -t /dev/nvme0n1       NAME    ALIGNMENT MIN-IO OPT-IO
           | PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME       nvme0n1
           | 0    512      0     512     512    0 none     1023 128    0B
           | $ sudo nvme id-ns -H /dev/nvme0n1 | grep Size       LBA
           | Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes -
           | Relative Performance: 0 Best (in use)
        
             | jeffbee wrote:
             | Thanks for checking. SSD review sites never mention this
             | important detail. For some reason the Samsung datacenter
             | SSDs support 4K LBA format, and they are very similar to
             | the retail SSDs which don't seem to. I have the a retail
             | 970 Evo that only provides 512.
        
               | wtallis wrote:
               | I just checked my logs, and none of Samsung's consumer
               | NVMe drives have ever supported sector sizes other than
               | 512B. They seem to view this feature as part of their
               | product segmentation strategy.
               | 
               | Some consumer SSD vendors do enable 4kB LBA support. I've
               | seen it supported on consumer drives from WD, SK hynix
               | and a variety of brands using Phison or SMI SSD
               | controllers (including Kingston, Seagate, Corsair,
               | Sabrent). But I haven't systematically checked to see
               | which brands consistently support it.
        
               | 1996 wrote:
               | Is it genuine 512?
               | 
               | As in, what ashift value do you use with zfs?
        
               | wtallis wrote:
               | Regardless of what sector size you configure the SSD to
               | expose, the drive's flash translation layer still manages
               | logical to physical mappings at a 4kB granularity, the
               | underlying media page size is usually on the order of
               | 16kB, and the erase block size is several MB. So what
               | ashift value you want to use depends very much on what
               | kind of tradeoffs you're okay with in terms of different
               | aspects of performance and write endurance/write
               | amplification. But for most flash-based SSDs, there's no
               | reason to set ashift to anything less than 12
               | (corresponding to 4kB blocks).
        
             | guerby wrote:
             | Here is an article about nvme-cli tool :
             | 
             | https://nvmexpress.org/open-source-nvme-management-
             | utility-n...
             | 
             | On Samsung SSD 970 EVO 1TB it seems only 512 bytes LBA are
             | supported:                  # nvme id-ns /dev/nvme0n1 -n 1
             | -H|grep "^LBA Format"        LBA Format  0 : Metadata Size:
             | 0   bytes - Data Size: 512 bytes - Relative Performance: 0
             | Best (in use)
        
         | rafaelturk wrote:
         | Thanks for well written article, makes me think about
         | inefficiencies in our over-hyped cloud environment.
        
           | tanelpoder wrote:
           | Oh yes - and incorrectly configured on-premises systems too!
        
         | sitkack wrote:
         | Could you explain some of your thought processes and
         | methodologies when approaching problems like this?
         | 
         | What is your mental model like? How much experimentation do you
         | do verses reading kernel code? How do you know what questions
         | to start asking?
         | 
         | *edit, btw I understand that a response to these questions
         | could be an entire book, you get the question-space.
        
           | tanelpoder wrote:
           | Good question. I don't ever read kernel code as a starting
           | point, only if some profiling or tracing tool points me
           | towards an interesting function or codepath. And interesting
           | usually is something that takes most CPU in perf output or
           | some function call with an unusually high latency in ftrace,
           | bcc/bpftrace script output. Or just a stack trace in a core-
           | or crashdump.
           | 
           | As far as mindset goes - I try to apply the developer mindset
           | to system performance. In other words, I don't use much of
           | what I call the "old school sysadmin mindset", from a time
           | where better tooling was not available. I don't use
           | systemwide utilization or various get/hit ratios for doing
           | "metric voodoo" of Unix wizards.
           | 
           | The developer mindset dictates that everything you run is an
           | application. JVM is an application. Kernel is an application.
           | Postgres, Oracle are applications. All applications execute
           | one or more threads that run on CPU or do not run on CPU.
           | There are only two categories of reasons why a thread does
           | not run on CPU (is sleeping): The OS put the thread to sleep
           | (involuntary blocking) or the thread voluntarily wanted to go
           | to go to sleep (for example, it realized it can't get some
           | application level lock).
           | 
           | And you drill down from there. Your OS/system is just a bunch
           | of threads running on CPU, sleeping and sometimes
           | communicating with each other. You can _directly_ measure all
           | of these things easily nowadays with profilers, no need for
           | metric voodoo.
           | 
           | I have written my own tools to complement things like perf,
           | ftrace and BPF stuff - as a consultant I regularly see 10+
           | year old Linux versions, etc - and I find sampling thread
           | states from /proc file system is a really good (and flexible)
           | starting point for system performance analysis and even some
           | drilldown - all this without having to install new software
           | or upgrading to latest kernels. Some of the tools I showed in
           | my article too:
           | 
           | https://tanelpoder.com/psnapper & https://0x.tools
           | 
           | In the end of my post I mentioned that I'll do a webinar
           | "hacking session" next Thursday, I'll show more how I work
           | there :-)
        
         | vinay_ys wrote:
         | Very cool rig and benchmark. Kudos. Request: add network io
         | load to your benchmarking load while nvme io load is running.
        
           | tanelpoder wrote:
           | Thanks, will do in a future article! I could share the disks
           | out via NFS or iSCSI or something and hammer them from a
           | remote machine...
        
         | PragmaticPulp wrote:
         | This is a great article. Thanks for writing it up and sharing.
        
         | guerby wrote:
         | 71 GB/s is 568 Gbit/s so you'll need about 3 dual 100 Gbit/s
         | cards to pump data out at the rate you can read it from the
         | NVMe drives.
         | 
         | And ethernet (unless LAN jumbo frames) is about 1.5kByte per
         | frame (not 4kB).
         | 
         | One such PC should be able to do 100k simultaneous 5 Mbps HD
         | streams.
         | 
         | Testing this would be fun :)
        
           | zamadatix wrote:
           | Mellanox has a 2x200 Gbps NIC these days. Haven't gotten to
           | play with it yet though.
        
             | tanelpoder wrote:
             | Which NICs would you recommend for me to buy for testing at
             | least 1x100 Gbps (ideally 200 Gbps?) networking between
             | this machine (PCIe 4.0) and an Intel Xeon one that I have
             | with PCIe 3.0. Don't want to spend much money, so the cards
             | don't need to be too enterprisey, just fast.
             | 
             | And - do such cards even allow direct "cross" connection
             | without a switch in between?
        
               | drewg123 wrote:
               | All 100G is enterprisy.
               | 
               | For a cheap solution, I'd get a pair of used Mellanox
               | ConnectX4 or Chelsio T6, and a QSFP28 direct attach
               | copper cable.
        
               | zamadatix wrote:
               | +1 on what the sibling comment said.
               | 
               | As for directly connecting them absolutely, works great.
               | Id recommend a cheap DAC off fs.com to connect them in
               | that case.
        
           | drewg123 wrote:
           | At Netflix, I'm playing with an EPYC 7502P with 16 NVME and
           | dual 2x100 Mellanox ConnectX6-DX NICs. With hardware kTLS
           | offload, we're able to serve about 350Gb/s of real customer
           | traffic. This goes down to about 240Gb/s when using software
           | kTLS, due to memory bandwidth limits.
           | 
           | This is all FreeBSD, and is the evolution of the work
           | described in my talk at the last EuroBSDCon in 2019:
           | https://papers.freebsd.org/2019/eurobsdcon/gallatin-
           | numa_opt...
        
             | ksec wrote:
             | >we're able to serve about 350Gb/s of real customer
             | traffic.
             | 
             | I still remember the post about breaking 100Gbps barrier,
             | that was may be in 2016 or 17 ? And wasn't that long ago it
             | was 200Gbps and if I remember correct it was hitting memory
             | bandwidth barrier as well.
             | 
             | And now 350Gbps?!
             | 
             | So what's next? Wait for DDR5? Or moving to some memory
             | controller black magic like POWER10?
        
               | drewg123 wrote:
               | Yes, before hardware inline kTLS offload, we were limited
               | to 200Gb/s or so with Naples. With Rome, its a bit
               | higher. But hardware inline kTLS with the Mellanox CX6-DX
               | eliminates memory bandwidth as a bottleneck.
               | 
               | The current bottleneck is IO related, and its unclear
               | what the issue is. We're working with the hardware
               | vendors to try to figure it out. We should be getting
               | about 390Gb/s
        
               | ksec wrote:
               | Oh wow! Cant wait to hear more about it.
        
           | tanelpoder wrote:
           | I should (finally) receive my RTX 3090 card today (PCIe 4.0
           | too!), I guess here goes my weekend (and the following
           | weekends over a couple of years)!
        
         | tarasglek wrote:
         | You should look at cpu usage. There is a good chance all your
         | interrupts are hitting cpu-0. you can run hwloc to see what
         | chiplet the pci cards are on and handle interrupts on those
         | cores.
        
           | jeffbee wrote:
           | Why would that happen with the linux nvme stack that puts a
           | completion queue on each CPU?
        
             | wtallis wrote:
             | I think that in addition to allocating a queue per CPU, you
             | need to be able to allocate a MSI(-X) vector per CPU. That
             | shouldn't be a problem for the Samsung 980 PRO, since it
             | supports 128 queues and 130 interrupt vectors.
        
           | tanelpoder wrote:
           | Thanks for the "hwloc" tip. I hadn't thought about that.
           | 
           | I was thinking of doing something like that. Weirdly I got
           | sustained throughput differences when I killed & restarted
           | fio. So, if I got 11M IOPS, it stayed at that level until I
           | killed fio & restarted. If I got 10.8M next, it stayed like
           | it until I killed & restarted it.
           | 
           | This makes me think that I'm hitting some PCIe/memory
           | bottleneck, dependent on process placement (which process
           | happens to need to move data across infinity fabric due to
           | accessing data through a "remote" PCIe root complex or
           | something like that). But then I realized that Zen 2 has a
           | central IO hub again, so there shouldn't be a "far edge of
           | I/O" like on current gen Intel CPUs (?)
           | 
           | But there's definitely some workload placement and
           | I/O-memory-interrupt affinity that I've wanted to look into.
           | I could even enable the NUMA-like-mode from BIOS, but again
           | with Zen 2, the memory access goes through the central
           | infinity-fabric chip too, I understand, so not sure if
           | there's any value in trying to achieve memory locality for
           | individual chiplets on this platform (?)
        
             | wtallis wrote:
             | The PCIe is all on a single IO die, but internally it is
             | organized into quadrants that can produce some NUMA
             | effects. So it is probably worth trying out the motherboard
             | firmware settings to expose your CPU as multiple NUMA
             | nodes, and using the FIO options to allocate memory only on
             | the local node, and restricting execution to the right
             | cores.
        
               | tanelpoder wrote:
               | Yep, I enabled the "numa-like-awareness" in BIOS and ran
               | a few quick tests to see whether the NUMA-aware
               | scheduler/NUMA balancing would do the right thing and
               | migrate processes closer to their memory over time, but
               | didn't notice any benefit. But yep I haven't manually
               | locked down the execution and memory placement yet. This
               | placement may well explain why I saw some ~5% throughput
               | fluctuations _only if killing & restarting fio_ and not
               | while the same test was running.
        
               | syoc wrote:
               | I have done some tests on AMD servers and I the Linux
               | scheduler does a pretty good job. I do however get
               | noticeable (a couple percent) better performance by
               | forcing the process to run on the correct numa node.
               | 
               | Make sure you get as many numa domains as possible in
               | your BIOS settings.
               | 
               | I recommend using numactl with the cpu-exclusive and mem-
               | exclusive flags. I have noticed a slight perfomance drop
               | when RAM cache fills beyond the sticks local to the cpus
               | doing work.
               | 
               | One last comment is that you mentioned interrupts being
               | "stiped" among CPUs. I would recommend pinning the
               | interrupts from one disk to one numa-local CPU and using
               | numactl to run fio for that disk on the same CPU. An
               | additional experiment is to, if you have enough cores,
               | pin interrupts to CPUs local to disk, but use other cores
               | on the same numa node for fio. That has been my most
               | successful setup so far.
        
         | ksec wrote:
         | I just love this article. Especially when the norm is always
         | about scaling out instead of scaling up. We can have 128 Core
         | CPU, 2TB Memory, PCI-E 4.0 SSD, ( and soon PCI-E 5.0 ). We
         | could even fit a _Petabyte_ in 1U for SSD Storage.
         | 
         | I remember WhatsApp used to operate its _500M_ user with only a
         | dozen of large FreeBSD boxes. ( Only to be taken apart by
         | Facebook )
         | 
         | So Thank you for raising awareness. Hopefully the pendulum is
         | swinging back to conceptually simple design.
         | 
         | >I also have a 380 GB Intel Optane 905P SSD for low latency
         | writes
         | 
         | I would love to see that. Although I am waiting for someone to
         | do a review on the Optane SSD P5800X [1]. Random 4K IOPS up to
         | 1.5M with lower than 6 _us_ Latency.
         | 
         | [1] https://www.servethehome.com/new-intel-
         | optane-p5800x-100-dwp...
        
           | texasbigdata wrote:
           | Second on Optane.
        
           | phkahler wrote:
           | >> I remember WhatsApp used to operate its 500M user with
           | only a dozen of large FreeBSD boxes.
           | 
           | With 1TB of RAM you can have 256 bytes for every person on
           | earth live in memory. With SSD either as virtual memory or
           | keeping an index in RAM, you can do meaningful work in real
           | time, probably as fast as the network will allow.
        
           | rektide wrote:
           | Intel killing off prosumer optane 2 weeks ago[1] made me so
           | so so sad.
           | 
           | The new P5800X should be sick.
           | 
           | [1] https://news.ycombinator.com/item?id=25805779
        
         | KaiserPro wrote:
         | Excellent write up.
         | 
         | I used to work for a VFX company in 2008. At that point we used
         | lustre to get high throughput file storage.
         | 
         | From memory we had something like 20 racks of server/disks to
         | get a 3-6 gigabyte/s (sustained) throughput on a 300tb
         | filesystem.
         | 
         | It is hilarious to think that a 2u box can now theoretically
         | saturate 2x100gig nics.
        
       | qaq wrote:
       | Would be cool to see pgbench score for this setup
        
       | namero999 wrote:
       | You should be farming Chia on that thing [0]
       | 
       | Amazing, congrats!
       | 
       | [0] https://github.com/Chia-Network/chia-blockchain/wiki/FAQ
        
       | jayonsoftware1 wrote:
       | https://www.asus.com/us/Motherboard-Accessories/HYPER-M-2-X1...
       | vs https://highpoint-tech.com/USA_new/nvme_raid_controllers.htm .
       | One card is about x10 expensive, but looks like performance is
       | same. Am I missing some thing.
        
         | tanelpoder wrote:
         | The ASUS one doesn't have its own RAID controller nor PCIe
         | switch onboard. It relies on the motherboard-provided PCIe
         | bifurcation and if using hardware RAID, it'd use AMD's built-in
         | RAID solution (but I'll use software RAID via Linux dm/md). The
         | HighPoint SSD7500 seems to have a proprietary RAID controller
         | built in to it and some management/monitoring features too
         | (it's the "somewhat enterprisey" version)
        
           | wtallis wrote:
           | The HighPoint card doesn't have a hardware RAID controller,
           | just a PCIe switch and an option ROM providing boot support
           | for their software RAID.
           | 
           | PCIe switch chips were affordable in the PCIe 2.0 era when
           | multi-GPU gaming setups were popular, but Broadcom decided to
           | price them out of the consumer market for PCIe 3 and later.
        
             | tanelpoder wrote:
             | Ok, thanks, good to know. I misunderstood from their
             | website.
        
             | rektide wrote:
             | pcie switches getting expensive is so the suck.
        
       | qaq wrote:
       | Now price this in terms of AWS and marvel at the markup
        
         | speedgoose wrote:
         | I'm afraid Jeff Bezos himself couldn't afford such IOs on AWS.
        
       | nwmcsween wrote:
       | So Linus was wrong on his rant to Dave about the page cache being
       | detremental on fast devices
        
       | ogrisel wrote:
       | As a nitpicking person, I really like to read a post that does
       | not confuse GB/s for GiB/s :)
       | 
       | https://en.wikipedia.org/wiki/Byte#Multiple-byte_units
        
         | ogrisel wrote:
         | Actually now I realize that the title and the intro paragraph
         | are contradicting each other...
        
           | tanelpoder wrote:
           | Yeah, I used the formally incorrect GB in the title when I
           | tried to make it look as simple as possible... GiB just
           | didn't look as nice in the "marketing copy" :-)
           | 
           | I may have missed using the right unit in some other sections
           | too. At least I hope that I've conveyed that there's a
           | difference!
        
       ___________________________________________________________________
       (page generated 2021-01-29 23:00 UTC)