[HN Gopher] RAM is the new disk - and how to measure its perform...
       ___________________________________________________________________
        
       RAM is the new disk - and how to measure its performance (2015)
        
       Author : tanelpoder
       Score  : 52 points
       Date   : 2021-01-21 19:33 UTC (3 hours ago)
        
 (HTM) web link (tanelpoder.com)
 (TXT) w3m dump (tanelpoder.com)
        
       | vlovich123 wrote:
       | I think how DMA operates needs another serious look. Right now we
       | have to fetch everything into the CPU before we can make
       | decisions. What if we had asynchronous HW embedded within the
       | memory that could be given a small (safe) executable program to
       | process the memory in-place rather than evaluating it on the CPU.
       | In other words, a linked list would be much faster and simpler to
       | traverse.
       | 
       | A lot of the software architecture theory we learn is based on
       | existing HW paradigms without much thought being given to how we
       | can change HW paradigms. By nature HW is massively parallel but
       | where physical distance from compute = latency (vs the ultimately
       | serial execution nature of traditional CPUs that can process all
       | data at blistering speed but only one at a time with some SIMD
       | exceptions). There are real-world benefits to this kind of design
       | - memory is cheap and simple to manufacture and abundantly
       | available. The downside though is that the CPU is sitting doing
       | nothing but waiting for memory most of the time, especially when
       | processing large data sets.
       | 
       | Imagine how efficient a GC algorithm would be if it could compute
       | a result in the background just doing a concurrent mark and
       | sweep, perhaps as part of a DRAM refresh cycle so that you could
       | even choose to stop refreshing that RAM because your application
       | no longer needs that row.
       | 
       | The power and performance savings are pretty enticing.
        
         | cogman10 wrote:
         | Interestingly enough, this is why Simultaneous multithreading
         | [1] exists!
         | 
         | The revelation that "The CPU could be doing useful work while
         | stalled waiting on a load" has lead to CPU designers "faking"
         | the number of cores available to the OS to allow the CPU to do
         | more useful work while different "threads" are paused waiting
         | on memory to come back with the data they need.
         | 
         | [1] https://en.wikipedia.org/wiki/Simultaneous_multithreading
        
           | tanelpoder wrote:
           | And all kinds of manual prefetching instructions built into
           | the CPUs (for compilers to use) and prefetching algorithms &
           | predictors built into the CPUs too!
        
           | nomel wrote:
           | I think the point is that this still eats into CPU <-> memory
           | bandwidth. Offloading could let the CPU use that memory for
           | better purposes, especially since the stall is from memory
           | access anyways.
        
             | tanelpoder wrote:
             | When doing pointer chasing, then it's gonna be more of a
             | latency problem, there can be plenty of memory bandwidth
             | available, but we don't know which memory line we want to
             | go to next, before the previous line (where the pointer
             | resides) has been loaded. So, the CPUs spend a lot of time
             | in "stalled backend" mode due to the big difference in CPU
             | cycle latency vs RAM access latency.
             | 
             | Offloading some operations closer to the RAM would be a
             | hardware solution (like the Oracle/Sun SPARC DAX I
             | mentioned in a separate comment).
             | 
             | Or you could design your software to rely more on the
             | memory throughput (scanning through columnar structures) vs
             | memory latency (pointer chasing).
             | 
             | Btw, even with pointer-chasing, you could optimize the
             | application to work mostly from the CPU cache (assuming
             | that the CPUs don't do concurrent writes into these cache
             | lines all the time), but this would require not only
             | different application code, but different underlying data
             | structures too. That's pretty much what my article series
             | is about - fancy CPU throuhgput features like SIMD would
             | not be very helpful, if the underlying data (and memory)
             | structures don't support their way of thinking.
        
             | geocar wrote:
             | Linked-list chasing (as in vlovich123's example) isn't
             | bandwidth-limited as much as it is latency: SMT helps here
             | because you're able to enqueue multiple wait-states
             | "simultaneously".
        
               | jeffbee wrote:
               | Anything you can do with SMT you can do without SMT using
               | ILP instead. SMT does not grant a CPU extra cache-filling
               | resources.
        
               | geocar wrote:
               | SMT is easier to program.
               | 
               | Walking a single linked-list is nothing; A garbage
               | collector walks lots of them that tend to fan out for a
               | bit.
        
               | gpsar wrote:
               | Software SMT is even better for concurrent traversals as
               | you are not limited by the number of hw contexts
               | https://www.linkedin.com/pulse/dont-stall-multitask-
               | georgios...
        
               | jeffbee wrote:
               | SMT gives you some probability that thread X and thread Y
               | are under-using the resources of the CPU but if you lose
               | that bet you get antagonism. On Intel CPUs in particular
               | there are only 2 slots for filling cache lines, and
               | filling them randomly from main memory takes many cycles,
               | so two threads can easily get starved.
               | 
               | If thread X is chasing a linked list and thread Y is
               | compressing an MPEG then it's brilliant.
        
           | fctorial wrote:
           | They aren't faking the number. A hyperthreaded core is two
           | separate cores that share alu, fpu etc.
        
             | tanelpoder wrote:
             | I think it's more correct to say that a single core has two
             | "sets of registers" (and I guess instruction
             | decoders/dispatchers perhaps)... so it's a single core with
             | 2 "execution entry points"?
        
             | jzwinck wrote:
             | This is not how Intel sees it. They describe hyperthreading
             | as a single core having two (or more, but usually two) sets
             | of architectural state which basically means registers.
             | It's not two cores, it is one core that can switch between
             | two different instruction pointers. They share almost
             | everything else apart from the APIC.
        
         | lalaithion wrote:
         | Even just "go to memory location X+Y, load the value there, set
         | X = that value, repeat N times" would allow fast linked list
         | traversal, as well as doing various high-level pointer chasing
         | faster.
        
         | 55873445216111 wrote:
         | There are some memories supporting basic in-memory operations.
         | For example: https://mosys.com/products/blazar-family/be3rmw-
         | bandwidth-en.... This supports operations like read-modify-
         | write within the memory device itself. (I have no affiliation
         | with this company.)
         | 
         | The barrier to adoption of this is not technical, it's
         | economic. Memory industry has focused on making the highest
         | capacity and lowest cost/bit products. This drives high
         | manufacturing volume which drives economies of scale. Memory
         | products with integrated functions are inherently niche, and
         | therefore do not have anywhere near the market size and economy
         | of scale. Designers have decided (historically) that it is
         | cheaper at the system level to keep the logic operations within
         | the CPU and use a "dumb" commodity memory, even though this
         | necessitates more bandwidth usage. (It's a complex engineering
         | trade-off.)
         | 
         | With logic performance continuing to scale faster than memory
         | bandwidth, at some point an architecture that reduces the
         | required memory bandwidth (such as computing in-memory) might
         | start to make sense economically.
        
         | d_tr wrote:
         | A few months ago I wanted to take a look at the Gen-Z fabric
         | specifications, but unfortunately they still have a lame
         | members-only download request form in place.
        
         | tanelpoder wrote:
         | Oracle (Sun) latest CPUs support something called DAX (data
         | analytics extension, not the same thing as the Intel DAX -
         | direct access extensions). It's a coprocessor that allows
         | offloading simpler operations closer to RAM, apparently:
         | 
         | "DAX is an integrated co-processor which provides a specialized
         | set of instructions that can run very selective functionality -
         | Scan, Extract, Select, Filter, and Translate - at fast speeds.
         | The multiple DAX share the same memory interface with the
         | processors cores so the DAX can take full advantage of the
         | 140-160 GB/sec memory bandwidth of the SPARC M7 processor."
         | 
         | https://blogs.oracle.com/bestperf/accelerating-spark-sql-usi...
        
         | dragontamer wrote:
         | > Right now we have to fetch everything into the CPU before we
         | can make decisions.
         | 
         | Think about virtual memory. The memory at location #6000 is NOT
         | at where the program thinks is at memory location #6000.
         | 
         | In fact, the program might be reading / writing to memory
         | location 0x08004000 (or close to that, whatever the magic start
         | address was on various OS like Linux / Windows). And then your
         | CPU virtually-translates that address to memory-stick #0
         | column#4000 or whatever.
         | 
         | Because of virtual memory: all memory operations must be
         | translated by the CPU before you actually go to RAM (and that
         | translation may require a page-table walk in the worst case)
        
           | ben509 wrote:
           | Well, all memory operations must be translated by an MMU. So
           | can the MMU be a distributed beast that lives in the CPU, the
           | memory and the IO?
        
             | dragontamer wrote:
             | > So can the MMU be a distributed beast that lives in the
             | CPU, the memory and the IO?
             | 
             | That "blah = blah->next" memory operation could be:
             | 
             | * Going to swap thanks to swap / overcommit behavior on
             | Linux/Windows.
             | 
             | * Or going to a file thanks to mmap
             | 
             | * Going to Ethernet and to another computer thanks to RDMA
             | 
             | So... no. I'm pretty sure the CPU is the proper place for
             | that kind of routing.
        
               | freeone3000 wrote:
               | why is that? we need the CPU to handle page fault
               | interrupts, in order to populate the RAM. But assuming
               | the page is already in RAM, there's no reason any of the
               | memory accesses actually need to go through the CPU.
               | (hardware can already raise interrupts; if the MMU can
               | raise a page fault indicator, then you might be able to
               | bypass CPU entirely until a new page needs to be loaded)
               | 
               | moreover, if we have support for mmap at the MMU level,
               | we can cut the CPU bottleneck for disk access entirely.
               | the disk controller can already handle DMA, but there's
               | simply no way for things that aren't the CPU to trigger
               | it. DirectStorage is an effort for GPUs to trigger it,
               | but what if we could also trigger it by other means?
        
               | dragontamer wrote:
               | Okay, lets say page-faults are off the table for some
               | reason. Lets think about what can happen even if
               | everything is in RAM still.
               | 
               | * MMAP is still on the table: different processes can
               | share RAM at different addresses. (Process#1 thinks the
               | data is at memory location 0x90000000, Process#2 thinks
               | the data is at 0x70000000, but in both cases, the data is
               | at physical location 0x42).
               | 
               | * Physical location 0x42 is a far-read on a far-away NUMA
               | node. Which means the CPU#0 now needs to send a message
               | to a CPU#1 very far away to get a copy of that RAM. This
               | message traverses Intel Ultratransport or AMD Infinity
               | fabric (proprietary details), but its a remote message
               | that happens nonetheless.
               | 
               | * Turns out CPU#1 has modified location 0x42. Now CPU#1
               | must push the most recent copy out of L1 cache, into L2
               | cache... then into L3 cache, and then send it back to
               | CPU#0. CPU#0 has to wait until this process is done.
               | 
               | Modern computers work very hard to hold the illusion of a
               | singular memory space.
        
               | jeffbee wrote:
               | Pointers are virtual addresses but memory is accessed
               | physically. All of the means of translating virtual to
               | physical are in the CPU. If you are proposing throwing
               | out virtual addressing, I imagine you won't get a lot of
               | support for that idea.
        
         | charlesdaniels wrote:
         | This is indeed an idea that has been coming up from time to
         | time for over 25 years at least. I think the earliest
         | publication in this space was Gockhale's Terasys (DOI
         | 10.1109/2.375174).
         | 
         | It is a good idea in principle, but it's never really taken off
         | to my knowledge. Parallel programming is hard. Deeply embedded
         | programming is hard. The languages we have are mostly bad at
         | both.
         | 
         | If you want to search for more, the keyword is "processor in
         | memory" or "processing in memory" (PIM). Also "in memory
         | computing" is commonly used as well. A number of groups are
         | working on this right now. The new buss is "memristors"
         | (memristive computing). Whether or not any of it actually ends
         | up working or being commercially viable outside of a lab
         | remains to be seen.
        
         | wcerfgba wrote:
         | This reminds me of the Connection Machine architecture [1]
         | 
         | > Each CM-1 microprocessor has its own 4 kilobits of random-
         | access memory (RAM), and the hypercube-based array of them was
         | designed to perform the same operation on multiple data points
         | simultaneously, i.e., to execute tasks in single instruction,
         | multiple data (SIMD) fashion. The CM-1, depending on the
         | configuration, has as many as 65,536 individual processors,
         | each extremely simple, processing one bit at a time.
         | 
         | [1] https://en.wikipedia.org/wiki/Connection_Machine
        
         | BenoitP wrote:
         | > HW embedded within the memory that could be given a small
         | (safe) executable program to process the memory in-place rather
         | than evaluating it on the CPU
         | 
         | Well, that seems to be the exact definition of what UPMEM is
         | doing:
         | 
         | https://www.upmem.com/upmem-announces-silicon-based-processi...
         | 
         | Between the M1, GPUs, TPUs, RISC-V interesting times are coming
         | in hardware. I blame the physical limits, which are putting the
         | Duke Nukem development method to an end (you promise the client
         | 2x the same performance in 18 months, then play Duke Nukem the
         | whole time). The only way to better performance now is through
         | hardware specialization. That and avoiding Electron.
        
           | WJW wrote:
           | You'd hope that the only way to better performance was HW
           | specialisation, but there are SO MANY algorithmic
           | improvements still to be made. Just the other day I found a
           | case someone had rolled their own priority queue with a
           | sorted array instead of a heap. In a fairly popular open
           | source library too.
           | 
           | There's still loads of performance to be gained just by being
           | better programmers.
        
             | spockz wrote:
             | How do you propose to fix this? Should languages include
             | high(er?) performance data structures in their standard
             | libraries? Or possibly even include some segmentation for
             | small/medium/huge data sets?
        
             | Arelius wrote:
             | > Just the other day I found a case someone had rolled
             | their own priority queue with a sorted array instead of a
             | heap.
             | 
             | I feel like this story needs an ending? Did somebody re-
             | implement it using a heap and found significant performance
             | wins? Or was the sorted array used on purpose to take
             | advantage of some specific cache constraints and actually
             | end up being a huge win?
        
       | rektide wrote:
       | meanwhile, disk is potentially getting to be as fast as ram,
       | throughput wise.
       | 
       | 128 lanes of pcie 4.0 is 256GBps iirc. epyc's 8 channel ddr4-3200
       | otoh is good for 208GBps.
       | 
       | Let's Encrypt stopped a little short, using 24x nvme disks (it
       | fits in a 2U though so that's nice)[1]. that could be up to 96 of
       | 128 pcie links in use. with the right ssds, working on large
       | data-objects, that'd be somewhere a bit under 192GBps versus the
       | max 208GBps of their ram.
       | 
       | in truth, ram's random access capabilities are far better,
       | there's much less overhead (although nvme is pretty good). and
       | i'm not sure i've ever seen anyone try to confirm that those 128
       | lanes of pcie on epyc aren't oversubscribed, that devices really
       | can push that much data around. note that this doesn't
       | necessarily even have to mean using the cpu; pci p2p is where
       | it's at for in-the-know folks doing nvme, network, and gpu data-
       | pushing; epyc's io-die is acting like a data-packet switch in
       | these conditions, rather than having the cpu process/crunch these
       | peripheral's data.
       | 
       | [1] https://news.ycombinator.com/item?id=25861422
        
         | tanelpoder wrote:
         | Author (of the RAM article) here:
         | 
         | Indeed, you can go further, but got to plan for other bandwidth
         | needed by other peripherals, data movement and inter-CPU
         | bandwidth (NUMA) and intra-CPU-core bandwidth limitations too
         | (AMD's infinity fabric is point to point between chiplets, but
         | intel has some ring-bus architecture for moving bits between
         | CPU cores).
         | 
         | I got my Lenovo ThinkStation P620 workstation (with AMD Zen-2
         | ThreadRipper Pro WX, 8-memory channels like EPYC) to scan 10 x
         | PCIe 4.0 SSDs at 66 GB/s (I had to move SSD cards around so
         | they'd use separate PCIe root complexes to avoid a PCIe <-> CPU
         | data transfer bottleneck. And even with doing I/O through 3
         | PCIe root complexes (out of 4 connected to that CPU), I seem to
         | be hitting some inter-CPU-core bandwidth limitation. The
         | throughput differs depending on which specific CPU cores happen
         | to run the processes doing I/Os against different SSDs.
         | 
         | Planning to publish some blog entries about these I/O tests but
         | a teaser tweet is here (11M IOPS with a single-socket
         | ThreadRipper workstation - it's not even a NUMA server! :-)
         | 
         | https://twitter.com/TanelPoder/status/1352329243070504964
        
           | 1996 wrote:
           | > I had to move SSD cards around so they'd use separate PCIe
           | root complexes to avoid a PCIe <-> CPU data transfer
           | bottleneck
           | 
           | I am doing similar things. Have you considered looking at how
           | to control by software the PCI lanes assignment?
           | 
           | Intel HSIO seems to be software configurable - except that
           | usually, it's all done just by the bios.
           | 
           | But as PCI specs allow for both device-side and host-side
           | negotiations, it should be doable without "moving SSDs
           | around"
           | 
           | > The throughput differs depending on which specific CPU
           | cores happen to run the processes doing I/Os against
           | different SSDs.
           | 
           | That strikes me as odd. I would check the detail of the PCI
           | lanes and their routing. You could have something funky going
           | on. My first guess would be that it's slow on one core
           | because it's also handling something else, by design or by
           | accident.
           | 
           | There're some bad hardware designs out there. But thanks to
           | stuff like HSIO, it should now be possible to fix the worst
           | ones by software (how else would the bios do it otherwise!)
           | just like in the old days of isapnptools!
        
             | tanelpoder wrote:
             | As this is an AMD machine - and as it's a workstation, not
             | server, perhaps this is why they've restricted it in BIOS.
             | 
             | I'm not too much of an expert in PCI express - but if this
             | workstation has 4 PCIe root complexes/host bridges, each
             | capable of x32 PCIe 4.0 lanes - and there are no multi-root
             | PCIe switches, wouldn't a lane physically have to
             | communicate with just one PCIe root complex/CPU "port"?
        
           | MayeulC wrote:
           | That bandwidth limitation could be due to Infinity Fabric,
           | which seems to be rated at 42GBps (x2 as full duplex,
           | though)?
           | 
           | https://en.wikichip.org/wiki/amd/infinity_fabric
        
             | tanelpoder wrote:
             | Yes, that's what I'm suspecting too, although with higher
             | clocked RAM, I should have somewhat more bandwidth. My
             | DIMMs are 3200 MT, so should be running at 1600 MHz. But I
             | saw a note (not sure where) that Infinity Fabric can run up
             | to 2933 MT on my machine and it would run in sync with
             | memory with DIMMs only up to 2933 MT. Unfortunately my BIOS
             | doesn't allow to downgrade the RAM "clock" from 3200 MT to
             | 2933, thus Ininity Fabric is running "out of sync" with my
             | RAM.
             | 
             | This should mean non-ideal memory access latency at least,
             | not sure how it affects throughput of large sequential
             | transfers.
             | 
             | I'm planning to come up with some additional tests and
             | hopefully write up a "part 2" too.
        
               | 1996 wrote:
               | > Unfortunately my BIOS doesn't allow to downgrade the
               | RAM "clock"
               | 
               | How deep are you willing to go?
               | 
               | The RAM clock is controlled by the memory training
               | algorithms. They use data from the XMP, which can be
               | edited.
               | 
               | The simplest is to reflash your memory sticks to alter
               | their XMP, so the training algorithm will reach the
               | conclusions you want. There's some Windows software to do
               | that.
               | 
               | You could also implement your own MRC, something done by
               | coreboot and the likes.
        
               | tanelpoder wrote:
               | Ha, thanks for the idea! I was briefly thinking of buying
               | 2933 "MHz" RAM for the test (as I later would put it into
               | my other workstation that can go up to 2600 "MHz" only),
               | but then I realized I don't have time for this right now
               | (will do my throughput, performance stability tests first
               | and maybe look into getting the most out of the latency
               | later).
        
       ___________________________________________________________________
       (page generated 2021-01-21 23:01 UTC)