[HN Gopher] Computing Performance on the Horizon
       ___________________________________________________________________
        
       Computing Performance on the Horizon
        
       Author : mrry
       Score  : 133 points
       Date   : 2021-07-05 14:24 UTC (8 hours ago)
        
 (HTM) web link (brendangregg.com)
 (TXT) w3m dump (brendangregg.com)
        
       | hackermeows wrote:
       | Nice talk , covers a lot of base . He predicts Unikernels are
       | dead, Containers will keep growing and lighthweight vms will take
       | over after that.
        
       | martinpw wrote:
       | Slide 26 is interesting - arguing that cloud providers have an
       | advantage for future CPU design since they can analyze so many
       | real world customer workloads directly.
       | 
       | In previous roles I have worked with CPU vendors who have been
       | very keen on getting access to profiling data from our workloads
       | for design optimization, and lamenting the fact that it was hard
       | to get such data and they were often limited to synthetic
       | benchmark workloads when tuning new designs.
       | 
       | So this argument does sound like a valid one, and does imply AWS
       | etc will have significant advantages in future designs.
        
         | justicezyx wrote:
         | It's already happening. Intel design now are significantly
         | influenced by Google and Amazon's data center needs.
        
         | handrous wrote:
         | An interesting application of a now-familiar pattern: get lots
         | of users, spy on them at massive scale, use those data to
         | dominate some other market in a way that, at most, a single
         | digit count of companies in the world could conceivably compete
         | with (because none but they have anything like the data that
         | you do). See also: everything to do with "AI".
        
           | uluyol wrote:
           | Amazon, Microsoft, and Google have massive applications and
           | systems that they run, some of which they sell as a service.
           | They have plenty of workload data without having to poke
           | around user VMs.
        
             | handrous wrote:
             | You disagree with slide 26, then?
        
         | thechao wrote:
         | He sort of implies that "just better hardware" will peter out
         | in the 2030s. I think he's calling it at least 50 years too
         | soon. Here's why: (1) I think logic designers are still faffing
         | about in term of optimizing their designs; and (2), I think
         | there's a lot of smart people thinking "incrementally" through
         | what we'd consider paradigm shifts in HW implementation. That
         | is, our fabs will just _naturally_ segue into 3D, spintronics,
         | etc. I think he even mentions 3D circuits? One thing a lot of
         | people miss is that layout of the design is materially
         | different in 3D vs 2D: in 2D layout is NP-hard (complete)
         | without efficient polynomial approximations; in 3D layout is
         | low-order polynomial. The reduction in layout complexity will
         | allow us to design things that are unthinkable right now, due
         | to layout constraints  & wire congestion.
        
           | hobofan wrote:
           | Aren't "big" 3D circuits unfeasible due to temperature
           | limitations, though?
        
             | a1369209993 wrote:
             | > Aren't "big" 3D circuits unfeasible due to temperature
             | limitations, though?
             | 
             | Less so than you'd think, since (as long as you can keep
             | leakage current under control) heat is only generated when
             | the circuits are _active_ (barring leakage current, when
             | bits are being flipped). So you can have arbitrarily large
             | amounts of increasingly-rarely-used circuitry for various
             | purposes.
             | 
             | The naive-but-easy-to-understand example would be having
             | separate, optimal circuitry for each machine instruction -
             | the total number of gates is O(N*M), but the number of
             | gates _activated_ on each clock cycle (and thus the amount
             | of waste heat generated) is only O(N), so you can keep
             | adding new, perfectly-hardware-accelerated instructions up
             | to the limits of physical space. In practice it 's more
             | complicated and less of a nonissue, but it's not "big
             | circuits are useful in proportion to their surface area,
             | not their volume", it's more "big circuits are less useful
             | than their volume alone would suggest". (You do hit
             | physical limits like the Bekenstein bound[0] eventually,
             | but that's far enough out that we mostly don't care yet.)
             | 
             | 0: https://en.wikipedia.org/wiki/Bekenstein_bound
        
           | nradov wrote:
           | How are we going to cool those 3D chips?
        
             | thechao wrote:
             | The commenter above is correct: just stop toggling HW. We
             | already do this to a great extent; we're limited in the
             | number of custom implementations because we can't wire
             | everything together. 3D chips will have a lot more "dark"
             | logic than current chips, but will be orders-of-magnitude
             | more efficient (& thus powerful) due to deep customization.
             | 
             | Also, remember the argument of my timeline is ~50-80 years
             | out from now.
        
               | nradov wrote:
               | Dark logic isn't doing any useful work so what's the
               | point? Sure you can include specialized circuitry for a
               | bunch of rare cases, but that won't lift overall system
               | performance much and will kill manufacturing yields.
        
           | yjftsjthsd-h wrote:
           | > in 2D layout is NP-hard (complete) without efficient
           | polynomial approximations; in 3D layout is low-order
           | polynomial
           | 
           | Any chance you could explain to a novice why 3D is easier? To
           | my naive intuition, it would have seemed like the more room
           | to maneuver is offset by having more stuff to route.
        
             | thechao wrote:
             | Try laying out a square with all the vertices connected _in
             | a plane_ : it's not possible. You must move one of the
             | wires "up" a layer. Which wire? Great layout minimizes
             | layer transitions while also bunching together related HW
             | blocks. The NP-completeness proof is related to work
             | Knuth's student (Plass?) did on laying out images in TeX.
        
       | jeffbee wrote:
       | A delightful set of slides and references that tickles many of my
       | pet topics. In particular, one I'd love to hear more about is why
       | so many deployments are still choosing 2-socket servers by
       | default when managing them is such a pain in the neck and the
       | performance when you do it badly is so poor. Live the life of the
       | future, today: choose single sockets!
        
         | syoc wrote:
         | Rack space can be quite expensive. Sometimes you need a lot of
         | computing power in one or two rack units.
         | 
         | Would be interested in what the management pains are. I agree
         | that 2 socket machines require more thought in a lot of
         | scenarios, especially IO heavy workloads.
        
           | jeffbee wrote:
           | The OpenCompute "Delta Lake" machine mentioned in the article
           | occupies only one third of 1RU and peaks at 400W. You will
           | certainly be power/cooling limited, rather than volume
           | limited, with that kind of density.
        
           | zaroth wrote:
           | In my very limited experience it seems like space is much
           | less an issue than power density.
           | 
           | You can fit far more kW/U than the datacenter can possibly
           | cool.
           | 
           | In the commodity space that I rent, I ran out of power before
           | filling even half the rack. I'm sure higher power/cooling
           | density is possible to obtain, but I would think you're
           | primarily paying for that versus square footage?
        
             | zozbot234 wrote:
             | What do you need that power density for? It's a rack, not a
             | supercomputer. (I sure hope it's not "mining coins" or
             | anything like that.)
        
         | soulbadguy wrote:
         | In world where most work loads are containerized, and where
         | each container can be pinned to numa region doesn't it really
         | matter?
        
           | wmf wrote:
           | Does any container runtime/orchestrator perform this
           | optimization yet? Why wait?
        
             | syoc wrote:
             | Kernel scheduling is NUMA aware and will localize
             | workloads. Threads will mostly have their RAM on the sticks
             | local to their node. The core the thread is delegated to is
             | also more likely to be the core local to the disk or NIC
             | being used for IO.
             | 
             | This is at least my experience, though I am no expert.
        
             | solarkennedy wrote:
             | Titus (Netflix's container orchestrator that I work on)
             | does this via: https://github.com/Netflix-Skunkworks/titus-
             | isolate
        
           | jeffbee wrote:
           | k8s, by default, is oblivious to NUMA topology. You have to
           | enable unreleased features and configure them correctly,
           | which is the unwanted complexity to which I referred earlier.
           | Simply aligning your containers to NUMA domains does not
           | solve the problem that your arriving network frames or your
           | NVMe completion queues can still be on the wrong domain.
           | Isn't it simpler to just have 1 socket and not need to care?
           | The number of cores available on a single socket system is
           | pretty high these days, and in general the 1S parts are
           | cheaper and faster.
        
             | eloff wrote:
             | Yeah, it makes a lot of sense to go with single socket
             | servers unless you can't scale horizontally (e.g. database
             | server). Why deal with the complexity when you can just
             | side step it.
        
               | dragontamer wrote:
               | Why would you switch from a 100GBps NUMA connection (800
               | gigabits per second) over NUMA fabric into a 10 Gbps
               | Ethernet fabric?
               | 
               | If you are scaling horizontally, NUMA is the superior
               | fabric than Ethernet or Infiniband (100Gbps)
               | 
               | Horizontal scaling seems to favor NUMA. 1000 chips over
               | Ethernet is less efficient than 500 dual socket nodes
               | over Ethernet. Anything you can do over Ethernet seems
               | easier and cheaper over NUMA instead.
        
               | eloff wrote:
               | I'm talking mostly abour scaling things like app servers
               | where they might not need any communication.
               | 
               | But in general if you can't scale horizontally at 10
               | gbps, you're in for a world of hurt. Numa gets you to 8x
               | scale at best on very expensive very exotic hardware. And
               | then you hit the wall.
        
               | dragontamer wrote:
               | I'm mostly talking about 2 socket servers, which are IIRC
               | more common than even single socket servers.
               | 
               | Dual socket is a cheap, easy, and common. If only to
               | recycle fans, power supplies and racks, it seems useful.
        
               | wmf wrote:
               | This is correct if your software is NUMA-optimized (or if
               | auto-NUMA works well for you) but if it isn't you can end
               | up with slowdowns.
        
               | dragontamer wrote:
               | Surely that can be fixed with just a well placed numactl
               | command to set node affinity and CPU affinity.
               | 
               | The root article is discussing rewriting code to fit on
               | FPGAs. If NUMA is too complex then... I dunno. The FPGA
               | argument seems dead on arrival.
        
               | kortilla wrote:
               | Your scaling architecture sucks if it depends on that
               | kind of throughput. If you need that you've only can
               | kicked your way to more capacity without a real scaling
               | fix.
        
               | dragontamer wrote:
               | Depend on? Heavens no.
               | 
               | Dual socket has numerous advantages in density and rack
               | space. The fact that performance is better is pretty much
               | icing on the cake.
               | 
               | It's easier to manage 500 dual socket servers than 1000
               | single socket servers. Less equipment, higher utilization
               | of parts, etc. Etc.
               | 
               | To suggest dual socket NUMA is going away is... just very
               | unlikely to me. I don't see what the benefits would be at
               | all. Not just performance, but also routine maintenance
               | issues (power, Ethernet, local storage, etc etc)
        
       | dragontamer wrote:
       | FPGAs from Xilinx are very complicated. They are no longer
       | homogeneous 4-LUTs or 6LUTs with dedicated multipliers here and
       | there.
       | 
       | Today's FPGAs are VLIW minicores capable of SIMD execution with
       | custom routing and some LUTs thrown around. They've stepped
       | towards GPU style architecture while retaining the custom logic
       | portions.
       | 
       | FPGAs remain so difficult to use, I find it unlikely that they'd
       | be mainstream in any capacity. GPUs seem like the easier way to
       | get access to HBM + heavy compute, but either way the HBM future
       | is eminent.
       | 
       | ------------
       | 
       | GPUs have big questions about ease of use and practicality as it
       | is, even with widespread acceptance of their compute potential.
       | FPGAs are much less known, it's hard for me to imagine a
       | mainstream future of them.
       | 
       | Since memory bounds remains the biggest issue and not compute
       | performance, I bet that the easiest to use accelerator with mass
       | production and cheap access to the highest speed HBM is going to
       | be the winner. GPUs are the current frontrunner, but the Fujitsu
       | ARM CPU has easy access to HBM and could be a wildcard.
       | 
       | POWER10 will be using high performance GDDR6. Not quite HBM, but
       | it signals that IBM is also concerned with the memory bandwidth
       | problem in the near future.
       | 
       | CPUs could very well switch to HBM in some scenarios.
       | 
       | ------------
       | 
       | If I were to guess the future: I think that AMD and NVidia have
       | proven that today's systems need high speed routers to
       | practically scale
       | 
       | AMD has their IO die on EPYC. NVidia has NVLink and NVSwitch.
       | That seems to be how to get more dies / sockets without
       | additional NUMA hops.
       | 
       | More efficient networks of chips with explicit switching /
       | routing topologies is the only way to scale. The exact form of
       | this network is still a mystery, but that's my big bet for the
       | future.
       | 
       | HBM is probably the future for high performance. DDR5 for cheaper
       | bulk RAM but HBM on high performance CPUs / GPUs / FPGAs is going
       | to be key.
       | 
       | ---------
       | 
       | The insight into RAM bottlenecks is interesting but seems to be
       | point in favor of SMT. If your core is 50% waiting on RAM, then
       | SMT into another thread to perform work while waiting on RAM.
        
         | Robotbeat wrote:
         | FPGAs are hard to work with in part because the tools are
         | extremely proprietary. But that is starting to change. Open
         | source FPGA tools are becoming more common and more powerful.
        
         | volta83 wrote:
         | > If your core is 50% waiting on RAM, then SMT into another
         | thread to perform work while waiting on RAM.
         | 
         | If your core is 50% waiting on RAM, then SMT into another
         | thread, and that other thread will want some memory to work on,
         | so it will also wait on RAM. On Top of it, this second thread
         | now puts extra pressure on the memory subsystem, might cause
         | cache evictions for the other thread, etc etc etc
         | 
         | The moment that you include the memory subsystem into the SMT
         | picture, SMT goes from a "no brainer; waiting on memory? do
         | other work" to a "uhhh... i don't know if this makes things
         | better or worse".
        
           | dragontamer wrote:
           | Not quite.
           | 
           | DDR4 and DDR5 have 50ns (single socket) to 150ns (dual
           | socket) latency.
           | 
           | For a 3GHz processor, that's 150 to 450 cycles.
           | 
           | On any latency bound problem, SMT helps. However, what you
           | say is true on bandwidth bound problems. Given the shear
           | number of pointer hopping that happens in typical OOP code
           | these days (or python / JavaScript), I expect SMT to be of
           | big help to typical applications.
           | 
           | DDR5 will double bandwidth in the near future. But that's not
           | enough: HBM and GDDR6 have a possible future because you can
           | only solve the bandwidth problem with more hardware. No
           | tricks like SMT can help.
        
             | zozbot234 wrote:
             | Memory bandwidth is just barely trying to keep up with
             | cores, frequencies and IPC amounts. Bandwidth available per
             | core is still going to drop. So newer development workflows
             | that optimize for this bottleneck are going to be very
             | relevant.
        
               | dragontamer wrote:
               | Memory bandwidth does improve at least.
               | 
               | Latency hasn't improved for the last 30 years. Tricks
               | like SMT which can help mitigate the latency issue seem
               | like the way forward.
        
       | rbanffy wrote:
       | I have an enormous respect for Brandon Gregg, but this "one
       | socket ought to be enough for anyone" is something I saw too many
       | people get burned with.
       | 
       | I mean, it should, but who knows what the next version of Slack
       | will need...
        
       | rektide wrote:
       | Some random contemporary musings, that touch some of these
       | topics: I really hope we have a rad eBPF based QUIC/HTTP3 front-
       | end/reverse-proxy router in the next 5 years.
       | 
       | QUIC is so exciting and I just want it to be both fast & a
       | supremely flexible way for a connection from a client to talk to
       | a host of backend services. We'll definitely see some classic
       | userland based approaches emerge, but gee, really hungry for
       | 
       | For context, I was at the park two days ago, thinking about
       | replacing a Node timesync[1] over websockets thing with a NTP-
       | over-WebTransport (QUIC) implementation. There werent any H3
       | front-ends (which I kind of need because I just have some random
       | colo & VPS boxes), and even if there were I was worried about
       | adding latency (which a BPF based solution would significantly
       | reduce, while letting me re-use ports 80/443).
       | 
       | Especially as we see more extreme-throughput/HBM memory systems
       | arrive, it's just so neat that we have a multiplexed transport
       | protocol. Figuring out how to use that connection (semi stateless
       | "connection", because QUIC is awsome) to talk to an array of
       | services is an ultra-interesting challenge, and BPF sure seems
       | like the go-to tech for routing & managing packets in the world
       | today. QUIC, with it's multiplexing, adds the complexity that it
       | is now subpackets that we want to route. I hope we can find a way
       | to keep a lot of that processing in the kernel.
       | 
       | [1] https://www.npmjs.com/package/timesync
        
       | ksec wrote:
       | >for storage including new uses for 3D Xpoint as a 3D NAND
       | accelerator;
       | 
       | 3D XPoint's future is not entirely certain. Intel with their new
       | CEO has remained rather quiet on the subject. Micron are pulling
       | the plug on it and sold the Fab to Texas Instrument. The problem
       | is there isn't a clear path forward with the technology, it make
       | some sense when NAND and DRAM price were high in 2016 - 2019.
       | Once they dropped to a normal level with newer DDR5 and faster
       | SLC NAND or ZNAND with lower latency than XPoint's cost benefits
       | becomes unclear. I guess we will know once Intel's Optane P5800X
       | [1] is out with review. It is quite a beast.
       | 
       | >Multi-Socket is Doomed
       | 
       | Are there really no use-case where 128 Core+ with NUMA offer some
       | advantage?
       | 
       | >Slower Rotational
       | 
       | Seagate [2] is actually working on dual Actuator HDD, think of it
       | as something like internal RAID 0. The rational being as HDD gets
       | bigger the time to fill up those drive increases as well.
       | 
       | >ARM on Cloud
       | 
       | Marvell partly confirms all HyperScalers have intention to build
       | their own ARM CPU. But Google just announced their Tau instances
       | [3], effectively cutting their cost / pref by 50%. Where each
       | vCPU is an entire physical CPU core rather than a x86 thread.
       | 
       | Not much mention on GPGPU.
       | 
       | [1]
       | https://www.intel.com/content/www/us/en/products/docs/memory...
       | 
       | [2] https://www.anandtech.com/show/16544/seagates-
       | roadmap-120-tb...
       | 
       | [3] https://cloud.google.com/blog/products/compute/google-
       | cloud-...
        
         | infogulch wrote:
         | > Are there really no use-case where 128 Core+ with NUMA offer
         | some advantage?
         | 
         | Are there any use cases where 128+ core single socket wouldn't
         | be preferred to a 128+ core multiple socket design that is
         | burdened by NUMA?
         | 
         | AMD has been showing us that integrating the interconnects into
         | the CPU package directly and letting it handle all the issues
         | is a better design.
        
           | dragontamer wrote:
           | When a hypothetical 128-core single socket comes out, will
           | there be no workload that prefers to use a 2x128-core dual
           | socket instead?
           | 
           | AMD CPUs remain largely dual-socket compatible. Today's
           | 64-core EPYCs can be dual-socketed into 2x64-core beasts.
           | 
           | It just seems silly to me that if you're building say 200
           | computers in 10x racks (20-computers per 10x 40U racks) that
           | you'd prefer single socket over dual-socket. If you're
           | scaling up and out so much, what exactly is the problem with
           | dual socket? Its not costs: dual socket remains cost-
           | effective on a per-core basis over single-socket. Dual-
           | sockets cuts the number of computers you need to work with in
           | half. Etc. etc.
        
       ___________________________________________________________________
       (page generated 2021-07-05 23:00 UTC)