[HN Gopher] RAM is the new disk - and how to measure its perform... ___________________________________________________________________ RAM is the new disk - and how to measure its performance (2015) Author : tanelpoder Score : 52 points Date : 2021-01-21 19:33 UTC (3 hours ago) (HTM) web link (tanelpoder.com) (TXT) w3m dump (tanelpoder.com) | vlovich123 wrote: | I think how DMA operates needs another serious look. Right now we | have to fetch everything into the CPU before we can make | decisions. What if we had asynchronous HW embedded within the | memory that could be given a small (safe) executable program to | process the memory in-place rather than evaluating it on the CPU. | In other words, a linked list would be much faster and simpler to | traverse. | | A lot of the software architecture theory we learn is based on | existing HW paradigms without much thought being given to how we | can change HW paradigms. By nature HW is massively parallel but | where physical distance from compute = latency (vs the ultimately | serial execution nature of traditional CPUs that can process all | data at blistering speed but only one at a time with some SIMD | exceptions). There are real-world benefits to this kind of design | - memory is cheap and simple to manufacture and abundantly | available. The downside though is that the CPU is sitting doing | nothing but waiting for memory most of the time, especially when | processing large data sets. | | Imagine how efficient a GC algorithm would be if it could compute | a result in the background just doing a concurrent mark and | sweep, perhaps as part of a DRAM refresh cycle so that you could | even choose to stop refreshing that RAM because your application | no longer needs that row. | | The power and performance savings are pretty enticing. | cogman10 wrote: | Interestingly enough, this is why Simultaneous multithreading | [1] exists! | | The revelation that "The CPU could be doing useful work while | stalled waiting on a load" has lead to CPU designers "faking" | the number of cores available to the OS to allow the CPU to do | more useful work while different "threads" are paused waiting | on memory to come back with the data they need. | | [1] https://en.wikipedia.org/wiki/Simultaneous_multithreading | tanelpoder wrote: | And all kinds of manual prefetching instructions built into | the CPUs (for compilers to use) and prefetching algorithms & | predictors built into the CPUs too! | nomel wrote: | I think the point is that this still eats into CPU <-> memory | bandwidth. Offloading could let the CPU use that memory for | better purposes, especially since the stall is from memory | access anyways. | tanelpoder wrote: | When doing pointer chasing, then it's gonna be more of a | latency problem, there can be plenty of memory bandwidth | available, but we don't know which memory line we want to | go to next, before the previous line (where the pointer | resides) has been loaded. So, the CPUs spend a lot of time | in "stalled backend" mode due to the big difference in CPU | cycle latency vs RAM access latency. | | Offloading some operations closer to the RAM would be a | hardware solution (like the Oracle/Sun SPARC DAX I | mentioned in a separate comment). | | Or you could design your software to rely more on the | memory throughput (scanning through columnar structures) vs | memory latency (pointer chasing). | | Btw, even with pointer-chasing, you could optimize the | application to work mostly from the CPU cache (assuming | that the CPUs don't do concurrent writes into these cache | lines all the time), but this would require not only | different application code, but different underlying data | structures too. That's pretty much what my article series | is about - fancy CPU throuhgput features like SIMD would | not be very helpful, if the underlying data (and memory) | structures don't support their way of thinking. | geocar wrote: | Linked-list chasing (as in vlovich123's example) isn't | bandwidth-limited as much as it is latency: SMT helps here | because you're able to enqueue multiple wait-states | "simultaneously". | jeffbee wrote: | Anything you can do with SMT you can do without SMT using | ILP instead. SMT does not grant a CPU extra cache-filling | resources. | geocar wrote: | SMT is easier to program. | | Walking a single linked-list is nothing; A garbage | collector walks lots of them that tend to fan out for a | bit. | gpsar wrote: | Software SMT is even better for concurrent traversals as | you are not limited by the number of hw contexts | https://www.linkedin.com/pulse/dont-stall-multitask- | georgios... | jeffbee wrote: | SMT gives you some probability that thread X and thread Y | are under-using the resources of the CPU but if you lose | that bet you get antagonism. On Intel CPUs in particular | there are only 2 slots for filling cache lines, and | filling them randomly from main memory takes many cycles, | so two threads can easily get starved. | | If thread X is chasing a linked list and thread Y is | compressing an MPEG then it's brilliant. | fctorial wrote: | They aren't faking the number. A hyperthreaded core is two | separate cores that share alu, fpu etc. | tanelpoder wrote: | I think it's more correct to say that a single core has two | "sets of registers" (and I guess instruction | decoders/dispatchers perhaps)... so it's a single core with | 2 "execution entry points"? | jzwinck wrote: | This is not how Intel sees it. They describe hyperthreading | as a single core having two (or more, but usually two) sets | of architectural state which basically means registers. | It's not two cores, it is one core that can switch between | two different instruction pointers. They share almost | everything else apart from the APIC. | lalaithion wrote: | Even just "go to memory location X+Y, load the value there, set | X = that value, repeat N times" would allow fast linked list | traversal, as well as doing various high-level pointer chasing | faster. | 55873445216111 wrote: | There are some memories supporting basic in-memory operations. | For example: https://mosys.com/products/blazar-family/be3rmw- | bandwidth-en.... This supports operations like read-modify- | write within the memory device itself. (I have no affiliation | with this company.) | | The barrier to adoption of this is not technical, it's | economic. Memory industry has focused on making the highest | capacity and lowest cost/bit products. This drives high | manufacturing volume which drives economies of scale. Memory | products with integrated functions are inherently niche, and | therefore do not have anywhere near the market size and economy | of scale. Designers have decided (historically) that it is | cheaper at the system level to keep the logic operations within | the CPU and use a "dumb" commodity memory, even though this | necessitates more bandwidth usage. (It's a complex engineering | trade-off.) | | With logic performance continuing to scale faster than memory | bandwidth, at some point an architecture that reduces the | required memory bandwidth (such as computing in-memory) might | start to make sense economically. | d_tr wrote: | A few months ago I wanted to take a look at the Gen-Z fabric | specifications, but unfortunately they still have a lame | members-only download request form in place. | tanelpoder wrote: | Oracle (Sun) latest CPUs support something called DAX (data | analytics extension, not the same thing as the Intel DAX - | direct access extensions). It's a coprocessor that allows | offloading simpler operations closer to RAM, apparently: | | "DAX is an integrated co-processor which provides a specialized | set of instructions that can run very selective functionality - | Scan, Extract, Select, Filter, and Translate - at fast speeds. | The multiple DAX share the same memory interface with the | processors cores so the DAX can take full advantage of the | 140-160 GB/sec memory bandwidth of the SPARC M7 processor." | | https://blogs.oracle.com/bestperf/accelerating-spark-sql-usi... | dragontamer wrote: | > Right now we have to fetch everything into the CPU before we | can make decisions. | | Think about virtual memory. The memory at location #6000 is NOT | at where the program thinks is at memory location #6000. | | In fact, the program might be reading / writing to memory | location 0x08004000 (or close to that, whatever the magic start | address was on various OS like Linux / Windows). And then your | CPU virtually-translates that address to memory-stick #0 | column#4000 or whatever. | | Because of virtual memory: all memory operations must be | translated by the CPU before you actually go to RAM (and that | translation may require a page-table walk in the worst case) | ben509 wrote: | Well, all memory operations must be translated by an MMU. So | can the MMU be a distributed beast that lives in the CPU, the | memory and the IO? | dragontamer wrote: | > So can the MMU be a distributed beast that lives in the | CPU, the memory and the IO? | | That "blah = blah->next" memory operation could be: | | * Going to swap thanks to swap / overcommit behavior on | Linux/Windows. | | * Or going to a file thanks to mmap | | * Going to Ethernet and to another computer thanks to RDMA | | So... no. I'm pretty sure the CPU is the proper place for | that kind of routing. | freeone3000 wrote: | why is that? we need the CPU to handle page fault | interrupts, in order to populate the RAM. But assuming | the page is already in RAM, there's no reason any of the | memory accesses actually need to go through the CPU. | (hardware can already raise interrupts; if the MMU can | raise a page fault indicator, then you might be able to | bypass CPU entirely until a new page needs to be loaded) | | moreover, if we have support for mmap at the MMU level, | we can cut the CPU bottleneck for disk access entirely. | the disk controller can already handle DMA, but there's | simply no way for things that aren't the CPU to trigger | it. DirectStorage is an effort for GPUs to trigger it, | but what if we could also trigger it by other means? | dragontamer wrote: | Okay, lets say page-faults are off the table for some | reason. Lets think about what can happen even if | everything is in RAM still. | | * MMAP is still on the table: different processes can | share RAM at different addresses. (Process#1 thinks the | data is at memory location 0x90000000, Process#2 thinks | the data is at 0x70000000, but in both cases, the data is | at physical location 0x42). | | * Physical location 0x42 is a far-read on a far-away NUMA | node. Which means the CPU#0 now needs to send a message | to a CPU#1 very far away to get a copy of that RAM. This | message traverses Intel Ultratransport or AMD Infinity | fabric (proprietary details), but its a remote message | that happens nonetheless. | | * Turns out CPU#1 has modified location 0x42. Now CPU#1 | must push the most recent copy out of L1 cache, into L2 | cache... then into L3 cache, and then send it back to | CPU#0. CPU#0 has to wait until this process is done. | | Modern computers work very hard to hold the illusion of a | singular memory space. | jeffbee wrote: | Pointers are virtual addresses but memory is accessed | physically. All of the means of translating virtual to | physical are in the CPU. If you are proposing throwing | out virtual addressing, I imagine you won't get a lot of | support for that idea. | charlesdaniels wrote: | This is indeed an idea that has been coming up from time to | time for over 25 years at least. I think the earliest | publication in this space was Gockhale's Terasys (DOI | 10.1109/2.375174). | | It is a good idea in principle, but it's never really taken off | to my knowledge. Parallel programming is hard. Deeply embedded | programming is hard. The languages we have are mostly bad at | both. | | If you want to search for more, the keyword is "processor in | memory" or "processing in memory" (PIM). Also "in memory | computing" is commonly used as well. A number of groups are | working on this right now. The new buss is "memristors" | (memristive computing). Whether or not any of it actually ends | up working or being commercially viable outside of a lab | remains to be seen. | wcerfgba wrote: | This reminds me of the Connection Machine architecture [1] | | > Each CM-1 microprocessor has its own 4 kilobits of random- | access memory (RAM), and the hypercube-based array of them was | designed to perform the same operation on multiple data points | simultaneously, i.e., to execute tasks in single instruction, | multiple data (SIMD) fashion. The CM-1, depending on the | configuration, has as many as 65,536 individual processors, | each extremely simple, processing one bit at a time. | | [1] https://en.wikipedia.org/wiki/Connection_Machine | BenoitP wrote: | > HW embedded within the memory that could be given a small | (safe) executable program to process the memory in-place rather | than evaluating it on the CPU | | Well, that seems to be the exact definition of what UPMEM is | doing: | | https://www.upmem.com/upmem-announces-silicon-based-processi... | | Between the M1, GPUs, TPUs, RISC-V interesting times are coming | in hardware. I blame the physical limits, which are putting the | Duke Nukem development method to an end (you promise the client | 2x the same performance in 18 months, then play Duke Nukem the | whole time). The only way to better performance now is through | hardware specialization. That and avoiding Electron. | WJW wrote: | You'd hope that the only way to better performance was HW | specialisation, but there are SO MANY algorithmic | improvements still to be made. Just the other day I found a | case someone had rolled their own priority queue with a | sorted array instead of a heap. In a fairly popular open | source library too. | | There's still loads of performance to be gained just by being | better programmers. | spockz wrote: | How do you propose to fix this? Should languages include | high(er?) performance data structures in their standard | libraries? Or possibly even include some segmentation for | small/medium/huge data sets? | Arelius wrote: | > Just the other day I found a case someone had rolled | their own priority queue with a sorted array instead of a | heap. | | I feel like this story needs an ending? Did somebody re- | implement it using a heap and found significant performance | wins? Or was the sorted array used on purpose to take | advantage of some specific cache constraints and actually | end up being a huge win? | rektide wrote: | meanwhile, disk is potentially getting to be as fast as ram, | throughput wise. | | 128 lanes of pcie 4.0 is 256GBps iirc. epyc's 8 channel ddr4-3200 | otoh is good for 208GBps. | | Let's Encrypt stopped a little short, using 24x nvme disks (it | fits in a 2U though so that's nice)[1]. that could be up to 96 of | 128 pcie links in use. with the right ssds, working on large | data-objects, that'd be somewhere a bit under 192GBps versus the | max 208GBps of their ram. | | in truth, ram's random access capabilities are far better, | there's much less overhead (although nvme is pretty good). and | i'm not sure i've ever seen anyone try to confirm that those 128 | lanes of pcie on epyc aren't oversubscribed, that devices really | can push that much data around. note that this doesn't | necessarily even have to mean using the cpu; pci p2p is where | it's at for in-the-know folks doing nvme, network, and gpu data- | pushing; epyc's io-die is acting like a data-packet switch in | these conditions, rather than having the cpu process/crunch these | peripheral's data. | | [1] https://news.ycombinator.com/item?id=25861422 | tanelpoder wrote: | Author (of the RAM article) here: | | Indeed, you can go further, but got to plan for other bandwidth | needed by other peripherals, data movement and inter-CPU | bandwidth (NUMA) and intra-CPU-core bandwidth limitations too | (AMD's infinity fabric is point to point between chiplets, but | intel has some ring-bus architecture for moving bits between | CPU cores). | | I got my Lenovo ThinkStation P620 workstation (with AMD Zen-2 | ThreadRipper Pro WX, 8-memory channels like EPYC) to scan 10 x | PCIe 4.0 SSDs at 66 GB/s (I had to move SSD cards around so | they'd use separate PCIe root complexes to avoid a PCIe <-> CPU | data transfer bottleneck. And even with doing I/O through 3 | PCIe root complexes (out of 4 connected to that CPU), I seem to | be hitting some inter-CPU-core bandwidth limitation. The | throughput differs depending on which specific CPU cores happen | to run the processes doing I/Os against different SSDs. | | Planning to publish some blog entries about these I/O tests but | a teaser tweet is here (11M IOPS with a single-socket | ThreadRipper workstation - it's not even a NUMA server! :-) | | https://twitter.com/TanelPoder/status/1352329243070504964 | 1996 wrote: | > I had to move SSD cards around so they'd use separate PCIe | root complexes to avoid a PCIe <-> CPU data transfer | bottleneck | | I am doing similar things. Have you considered looking at how | to control by software the PCI lanes assignment? | | Intel HSIO seems to be software configurable - except that | usually, it's all done just by the bios. | | But as PCI specs allow for both device-side and host-side | negotiations, it should be doable without "moving SSDs | around" | | > The throughput differs depending on which specific CPU | cores happen to run the processes doing I/Os against | different SSDs. | | That strikes me as odd. I would check the detail of the PCI | lanes and their routing. You could have something funky going | on. My first guess would be that it's slow on one core | because it's also handling something else, by design or by | accident. | | There're some bad hardware designs out there. But thanks to | stuff like HSIO, it should now be possible to fix the worst | ones by software (how else would the bios do it otherwise!) | just like in the old days of isapnptools! | tanelpoder wrote: | As this is an AMD machine - and as it's a workstation, not | server, perhaps this is why they've restricted it in BIOS. | | I'm not too much of an expert in PCI express - but if this | workstation has 4 PCIe root complexes/host bridges, each | capable of x32 PCIe 4.0 lanes - and there are no multi-root | PCIe switches, wouldn't a lane physically have to | communicate with just one PCIe root complex/CPU "port"? | MayeulC wrote: | That bandwidth limitation could be due to Infinity Fabric, | which seems to be rated at 42GBps (x2 as full duplex, | though)? | | https://en.wikichip.org/wiki/amd/infinity_fabric | tanelpoder wrote: | Yes, that's what I'm suspecting too, although with higher | clocked RAM, I should have somewhat more bandwidth. My | DIMMs are 3200 MT, so should be running at 1600 MHz. But I | saw a note (not sure where) that Infinity Fabric can run up | to 2933 MT on my machine and it would run in sync with | memory with DIMMs only up to 2933 MT. Unfortunately my BIOS | doesn't allow to downgrade the RAM "clock" from 3200 MT to | 2933, thus Ininity Fabric is running "out of sync" with my | RAM. | | This should mean non-ideal memory access latency at least, | not sure how it affects throughput of large sequential | transfers. | | I'm planning to come up with some additional tests and | hopefully write up a "part 2" too. | 1996 wrote: | > Unfortunately my BIOS doesn't allow to downgrade the | RAM "clock" | | How deep are you willing to go? | | The RAM clock is controlled by the memory training | algorithms. They use data from the XMP, which can be | edited. | | The simplest is to reflash your memory sticks to alter | their XMP, so the training algorithm will reach the | conclusions you want. There's some Windows software to do | that. | | You could also implement your own MRC, something done by | coreboot and the likes. | tanelpoder wrote: | Ha, thanks for the idea! I was briefly thinking of buying | 2933 "MHz" RAM for the test (as I later would put it into | my other workstation that can go up to 2600 "MHz" only), | but then I realized I don't have time for this right now | (will do my throughput, performance stability tests first | and maybe look into getting the most out of the latency | later). ___________________________________________________________________ (page generated 2021-01-21 23:01 UTC)