[HN Gopher] Computing Performance on the Horizon ___________________________________________________________________ Computing Performance on the Horizon Author : mrry Score : 133 points Date : 2021-07-05 14:24 UTC (8 hours ago) (HTM) web link (brendangregg.com) (TXT) w3m dump (brendangregg.com) | hackermeows wrote: | Nice talk , covers a lot of base . He predicts Unikernels are | dead, Containers will keep growing and lighthweight vms will take | over after that. | martinpw wrote: | Slide 26 is interesting - arguing that cloud providers have an | advantage for future CPU design since they can analyze so many | real world customer workloads directly. | | In previous roles I have worked with CPU vendors who have been | very keen on getting access to profiling data from our workloads | for design optimization, and lamenting the fact that it was hard | to get such data and they were often limited to synthetic | benchmark workloads when tuning new designs. | | So this argument does sound like a valid one, and does imply AWS | etc will have significant advantages in future designs. | justicezyx wrote: | It's already happening. Intel design now are significantly | influenced by Google and Amazon's data center needs. | handrous wrote: | An interesting application of a now-familiar pattern: get lots | of users, spy on them at massive scale, use those data to | dominate some other market in a way that, at most, a single | digit count of companies in the world could conceivably compete | with (because none but they have anything like the data that | you do). See also: everything to do with "AI". | uluyol wrote: | Amazon, Microsoft, and Google have massive applications and | systems that they run, some of which they sell as a service. | They have plenty of workload data without having to poke | around user VMs. | handrous wrote: | You disagree with slide 26, then? | thechao wrote: | He sort of implies that "just better hardware" will peter out | in the 2030s. I think he's calling it at least 50 years too | soon. Here's why: (1) I think logic designers are still faffing | about in term of optimizing their designs; and (2), I think | there's a lot of smart people thinking "incrementally" through | what we'd consider paradigm shifts in HW implementation. That | is, our fabs will just _naturally_ segue into 3D, spintronics, | etc. I think he even mentions 3D circuits? One thing a lot of | people miss is that layout of the design is materially | different in 3D vs 2D: in 2D layout is NP-hard (complete) | without efficient polynomial approximations; in 3D layout is | low-order polynomial. The reduction in layout complexity will | allow us to design things that are unthinkable right now, due | to layout constraints & wire congestion. | hobofan wrote: | Aren't "big" 3D circuits unfeasible due to temperature | limitations, though? | a1369209993 wrote: | > Aren't "big" 3D circuits unfeasible due to temperature | limitations, though? | | Less so than you'd think, since (as long as you can keep | leakage current under control) heat is only generated when | the circuits are _active_ (barring leakage current, when | bits are being flipped). So you can have arbitrarily large | amounts of increasingly-rarely-used circuitry for various | purposes. | | The naive-but-easy-to-understand example would be having | separate, optimal circuitry for each machine instruction - | the total number of gates is O(N*M), but the number of | gates _activated_ on each clock cycle (and thus the amount | of waste heat generated) is only O(N), so you can keep | adding new, perfectly-hardware-accelerated instructions up | to the limits of physical space. In practice it 's more | complicated and less of a nonissue, but it's not "big | circuits are useful in proportion to their surface area, | not their volume", it's more "big circuits are less useful | than their volume alone would suggest". (You do hit | physical limits like the Bekenstein bound[0] eventually, | but that's far enough out that we mostly don't care yet.) | | 0: https://en.wikipedia.org/wiki/Bekenstein_bound | nradov wrote: | How are we going to cool those 3D chips? | thechao wrote: | The commenter above is correct: just stop toggling HW. We | already do this to a great extent; we're limited in the | number of custom implementations because we can't wire | everything together. 3D chips will have a lot more "dark" | logic than current chips, but will be orders-of-magnitude | more efficient (& thus powerful) due to deep customization. | | Also, remember the argument of my timeline is ~50-80 years | out from now. | nradov wrote: | Dark logic isn't doing any useful work so what's the | point? Sure you can include specialized circuitry for a | bunch of rare cases, but that won't lift overall system | performance much and will kill manufacturing yields. | yjftsjthsd-h wrote: | > in 2D layout is NP-hard (complete) without efficient | polynomial approximations; in 3D layout is low-order | polynomial | | Any chance you could explain to a novice why 3D is easier? To | my naive intuition, it would have seemed like the more room | to maneuver is offset by having more stuff to route. | thechao wrote: | Try laying out a square with all the vertices connected _in | a plane_ : it's not possible. You must move one of the | wires "up" a layer. Which wire? Great layout minimizes | layer transitions while also bunching together related HW | blocks. The NP-completeness proof is related to work | Knuth's student (Plass?) did on laying out images in TeX. | jeffbee wrote: | A delightful set of slides and references that tickles many of my | pet topics. In particular, one I'd love to hear more about is why | so many deployments are still choosing 2-socket servers by | default when managing them is such a pain in the neck and the | performance when you do it badly is so poor. Live the life of the | future, today: choose single sockets! | syoc wrote: | Rack space can be quite expensive. Sometimes you need a lot of | computing power in one or two rack units. | | Would be interested in what the management pains are. I agree | that 2 socket machines require more thought in a lot of | scenarios, especially IO heavy workloads. | jeffbee wrote: | The OpenCompute "Delta Lake" machine mentioned in the article | occupies only one third of 1RU and peaks at 400W. You will | certainly be power/cooling limited, rather than volume | limited, with that kind of density. | zaroth wrote: | In my very limited experience it seems like space is much | less an issue than power density. | | You can fit far more kW/U than the datacenter can possibly | cool. | | In the commodity space that I rent, I ran out of power before | filling even half the rack. I'm sure higher power/cooling | density is possible to obtain, but I would think you're | primarily paying for that versus square footage? | zozbot234 wrote: | What do you need that power density for? It's a rack, not a | supercomputer. (I sure hope it's not "mining coins" or | anything like that.) | soulbadguy wrote: | In world where most work loads are containerized, and where | each container can be pinned to numa region doesn't it really | matter? | wmf wrote: | Does any container runtime/orchestrator perform this | optimization yet? Why wait? | syoc wrote: | Kernel scheduling is NUMA aware and will localize | workloads. Threads will mostly have their RAM on the sticks | local to their node. The core the thread is delegated to is | also more likely to be the core local to the disk or NIC | being used for IO. | | This is at least my experience, though I am no expert. | solarkennedy wrote: | Titus (Netflix's container orchestrator that I work on) | does this via: https://github.com/Netflix-Skunkworks/titus- | isolate | jeffbee wrote: | k8s, by default, is oblivious to NUMA topology. You have to | enable unreleased features and configure them correctly, | which is the unwanted complexity to which I referred earlier. | Simply aligning your containers to NUMA domains does not | solve the problem that your arriving network frames or your | NVMe completion queues can still be on the wrong domain. | Isn't it simpler to just have 1 socket and not need to care? | The number of cores available on a single socket system is | pretty high these days, and in general the 1S parts are | cheaper and faster. | eloff wrote: | Yeah, it makes a lot of sense to go with single socket | servers unless you can't scale horizontally (e.g. database | server). Why deal with the complexity when you can just | side step it. | dragontamer wrote: | Why would you switch from a 100GBps NUMA connection (800 | gigabits per second) over NUMA fabric into a 10 Gbps | Ethernet fabric? | | If you are scaling horizontally, NUMA is the superior | fabric than Ethernet or Infiniband (100Gbps) | | Horizontal scaling seems to favor NUMA. 1000 chips over | Ethernet is less efficient than 500 dual socket nodes | over Ethernet. Anything you can do over Ethernet seems | easier and cheaper over NUMA instead. | eloff wrote: | I'm talking mostly abour scaling things like app servers | where they might not need any communication. | | But in general if you can't scale horizontally at 10 | gbps, you're in for a world of hurt. Numa gets you to 8x | scale at best on very expensive very exotic hardware. And | then you hit the wall. | dragontamer wrote: | I'm mostly talking about 2 socket servers, which are IIRC | more common than even single socket servers. | | Dual socket is a cheap, easy, and common. If only to | recycle fans, power supplies and racks, it seems useful. | wmf wrote: | This is correct if your software is NUMA-optimized (or if | auto-NUMA works well for you) but if it isn't you can end | up with slowdowns. | dragontamer wrote: | Surely that can be fixed with just a well placed numactl | command to set node affinity and CPU affinity. | | The root article is discussing rewriting code to fit on | FPGAs. If NUMA is too complex then... I dunno. The FPGA | argument seems dead on arrival. | kortilla wrote: | Your scaling architecture sucks if it depends on that | kind of throughput. If you need that you've only can | kicked your way to more capacity without a real scaling | fix. | dragontamer wrote: | Depend on? Heavens no. | | Dual socket has numerous advantages in density and rack | space. The fact that performance is better is pretty much | icing on the cake. | | It's easier to manage 500 dual socket servers than 1000 | single socket servers. Less equipment, higher utilization | of parts, etc. Etc. | | To suggest dual socket NUMA is going away is... just very | unlikely to me. I don't see what the benefits would be at | all. Not just performance, but also routine maintenance | issues (power, Ethernet, local storage, etc etc) | dragontamer wrote: | FPGAs from Xilinx are very complicated. They are no longer | homogeneous 4-LUTs or 6LUTs with dedicated multipliers here and | there. | | Today's FPGAs are VLIW minicores capable of SIMD execution with | custom routing and some LUTs thrown around. They've stepped | towards GPU style architecture while retaining the custom logic | portions. | | FPGAs remain so difficult to use, I find it unlikely that they'd | be mainstream in any capacity. GPUs seem like the easier way to | get access to HBM + heavy compute, but either way the HBM future | is eminent. | | ------------ | | GPUs have big questions about ease of use and practicality as it | is, even with widespread acceptance of their compute potential. | FPGAs are much less known, it's hard for me to imagine a | mainstream future of them. | | Since memory bounds remains the biggest issue and not compute | performance, I bet that the easiest to use accelerator with mass | production and cheap access to the highest speed HBM is going to | be the winner. GPUs are the current frontrunner, but the Fujitsu | ARM CPU has easy access to HBM and could be a wildcard. | | POWER10 will be using high performance GDDR6. Not quite HBM, but | it signals that IBM is also concerned with the memory bandwidth | problem in the near future. | | CPUs could very well switch to HBM in some scenarios. | | ------------ | | If I were to guess the future: I think that AMD and NVidia have | proven that today's systems need high speed routers to | practically scale | | AMD has their IO die on EPYC. NVidia has NVLink and NVSwitch. | That seems to be how to get more dies / sockets without | additional NUMA hops. | | More efficient networks of chips with explicit switching / | routing topologies is the only way to scale. The exact form of | this network is still a mystery, but that's my big bet for the | future. | | HBM is probably the future for high performance. DDR5 for cheaper | bulk RAM but HBM on high performance CPUs / GPUs / FPGAs is going | to be key. | | --------- | | The insight into RAM bottlenecks is interesting but seems to be | point in favor of SMT. If your core is 50% waiting on RAM, then | SMT into another thread to perform work while waiting on RAM. | Robotbeat wrote: | FPGAs are hard to work with in part because the tools are | extremely proprietary. But that is starting to change. Open | source FPGA tools are becoming more common and more powerful. | volta83 wrote: | > If your core is 50% waiting on RAM, then SMT into another | thread to perform work while waiting on RAM. | | If your core is 50% waiting on RAM, then SMT into another | thread, and that other thread will want some memory to work on, | so it will also wait on RAM. On Top of it, this second thread | now puts extra pressure on the memory subsystem, might cause | cache evictions for the other thread, etc etc etc | | The moment that you include the memory subsystem into the SMT | picture, SMT goes from a "no brainer; waiting on memory? do | other work" to a "uhhh... i don't know if this makes things | better or worse". | dragontamer wrote: | Not quite. | | DDR4 and DDR5 have 50ns (single socket) to 150ns (dual | socket) latency. | | For a 3GHz processor, that's 150 to 450 cycles. | | On any latency bound problem, SMT helps. However, what you | say is true on bandwidth bound problems. Given the shear | number of pointer hopping that happens in typical OOP code | these days (or python / JavaScript), I expect SMT to be of | big help to typical applications. | | DDR5 will double bandwidth in the near future. But that's not | enough: HBM and GDDR6 have a possible future because you can | only solve the bandwidth problem with more hardware. No | tricks like SMT can help. | zozbot234 wrote: | Memory bandwidth is just barely trying to keep up with | cores, frequencies and IPC amounts. Bandwidth available per | core is still going to drop. So newer development workflows | that optimize for this bottleneck are going to be very | relevant. | dragontamer wrote: | Memory bandwidth does improve at least. | | Latency hasn't improved for the last 30 years. Tricks | like SMT which can help mitigate the latency issue seem | like the way forward. | rbanffy wrote: | I have an enormous respect for Brandon Gregg, but this "one | socket ought to be enough for anyone" is something I saw too many | people get burned with. | | I mean, it should, but who knows what the next version of Slack | will need... | rektide wrote: | Some random contemporary musings, that touch some of these | topics: I really hope we have a rad eBPF based QUIC/HTTP3 front- | end/reverse-proxy router in the next 5 years. | | QUIC is so exciting and I just want it to be both fast & a | supremely flexible way for a connection from a client to talk to | a host of backend services. We'll definitely see some classic | userland based approaches emerge, but gee, really hungry for | | For context, I was at the park two days ago, thinking about | replacing a Node timesync[1] over websockets thing with a NTP- | over-WebTransport (QUIC) implementation. There werent any H3 | front-ends (which I kind of need because I just have some random | colo & VPS boxes), and even if there were I was worried about | adding latency (which a BPF based solution would significantly | reduce, while letting me re-use ports 80/443). | | Especially as we see more extreme-throughput/HBM memory systems | arrive, it's just so neat that we have a multiplexed transport | protocol. Figuring out how to use that connection (semi stateless | "connection", because QUIC is awsome) to talk to an array of | services is an ultra-interesting challenge, and BPF sure seems | like the go-to tech for routing & managing packets in the world | today. QUIC, with it's multiplexing, adds the complexity that it | is now subpackets that we want to route. I hope we can find a way | to keep a lot of that processing in the kernel. | | [1] https://www.npmjs.com/package/timesync | ksec wrote: | >for storage including new uses for 3D Xpoint as a 3D NAND | accelerator; | | 3D XPoint's future is not entirely certain. Intel with their new | CEO has remained rather quiet on the subject. Micron are pulling | the plug on it and sold the Fab to Texas Instrument. The problem | is there isn't a clear path forward with the technology, it make | some sense when NAND and DRAM price were high in 2016 - 2019. | Once they dropped to a normal level with newer DDR5 and faster | SLC NAND or ZNAND with lower latency than XPoint's cost benefits | becomes unclear. I guess we will know once Intel's Optane P5800X | [1] is out with review. It is quite a beast. | | >Multi-Socket is Doomed | | Are there really no use-case where 128 Core+ with NUMA offer some | advantage? | | >Slower Rotational | | Seagate [2] is actually working on dual Actuator HDD, think of it | as something like internal RAID 0. The rational being as HDD gets | bigger the time to fill up those drive increases as well. | | >ARM on Cloud | | Marvell partly confirms all HyperScalers have intention to build | their own ARM CPU. But Google just announced their Tau instances | [3], effectively cutting their cost / pref by 50%. Where each | vCPU is an entire physical CPU core rather than a x86 thread. | | Not much mention on GPGPU. | | [1] | https://www.intel.com/content/www/us/en/products/docs/memory... | | [2] https://www.anandtech.com/show/16544/seagates- | roadmap-120-tb... | | [3] https://cloud.google.com/blog/products/compute/google- | cloud-... | infogulch wrote: | > Are there really no use-case where 128 Core+ with NUMA offer | some advantage? | | Are there any use cases where 128+ core single socket wouldn't | be preferred to a 128+ core multiple socket design that is | burdened by NUMA? | | AMD has been showing us that integrating the interconnects into | the CPU package directly and letting it handle all the issues | is a better design. | dragontamer wrote: | When a hypothetical 128-core single socket comes out, will | there be no workload that prefers to use a 2x128-core dual | socket instead? | | AMD CPUs remain largely dual-socket compatible. Today's | 64-core EPYCs can be dual-socketed into 2x64-core beasts. | | It just seems silly to me that if you're building say 200 | computers in 10x racks (20-computers per 10x 40U racks) that | you'd prefer single socket over dual-socket. If you're | scaling up and out so much, what exactly is the problem with | dual socket? Its not costs: dual socket remains cost- | effective on a per-core basis over single-socket. Dual- | sockets cuts the number of computers you need to work with in | half. Etc. etc. ___________________________________________________________________ (page generated 2021-07-05 23:00 UTC)