[HN Gopher] Intel Xeon Max 9480 Deep-Dive 64GB HBM2e Onboard Lik...
       ___________________________________________________________________
        
       Intel Xeon Max 9480 Deep-Dive 64GB HBM2e Onboard Like a GPU or AI
       Accelerator
        
       Author : PaulHoule
       Score  : 110 points
       Date   : 2023-09-19 15:00 UTC (2 days ago)
        
 (HTM) web link (www.servethehome.com)
 (TXT) w3m dump (www.servethehome.com)
        
       | Aissen wrote:
       | In caching mode, this is effectively a 64GB L4. Very impressive!
       | (AMD's biggest offering Genoa-X has 1.15GB of L3)
        
         | undersuit wrote:
         | Point of Order! It's not L4, it's a ram cache. Data from L1-L3
         | isn't stored there, only what you have written to or read from
         | ram.
         | 
         | Your working set of data won't spill out from the L3 to "the
         | L4" when it grows too large.
        
           | Aissen wrote:
           | I'm not sure I understand the difference. Are we both talking
           | about the "HBM Caching Mode" on this slide:
           | https://www.servethehome.com/intel-xeon-max-9480-deep-
           | dive-i... ?
        
             | undersuit wrote:
             | The memory caching, AFAIK, exist on different sides of
             | explicit load and store instructions than CPU caching. It
             | introduces subtle issues. You can see on page 4 a number of
             | cache friendly benchmarks show little benefit for the HBM
             | caching: https://www.servethehome.com/wp-
             | content/uploads/2023/09/Inte...
             | 
             | 64GB of HBM2e as ram is more performant than 128GB of DDR5
             | with HBM2e cache, and often the cached variant has no
             | speedup compared to a standard Intel configuration.
             | 
             | Also OpenFOAM loves cache: https://www.phoronix.com/benchma
             | rk/result/amd_ryzen_7_5800x3...
        
               | Aissen wrote:
               | I'd be curious to know more about the subtle issues
               | (which I don't doubt there might be!).
               | 
               | IMHO those results don't contradict the what I said. Of
               | course if the workload entirely fits in the 64GB HBM,
               | there's no point in using it as a cache, just use it
               | directly. But if you need to address more RAM(any big DB,
               | fs, etc.), and you don't want to manage the tier
               | manually, then the caching mode could shine.
        
       | bluedino wrote:
       | Our Dell rep really talked this up a few months ago, but didn't
       | have any benchmarks.
       | 
       | Impressed with OpenFOAM results, as that's a typical workload for
       | our users. However, the AMD system is basically equal.
        
       | mikeInAlaska wrote:
       | What a beast. It doesn't apply much to my life, but I did notice
       | one thing that would have been nice on several builds in the
       | past. The torque for the heatsink screws is specified in exact
       | lb/feet.
       | 
       | I have made builds in the past where you had to judge, tight
       | enough? I thought it was unsettling. I believe my last couple
       | builds had a more cam-lock feel to the heatsink though where it
       | was tightened to a point where it had an obvious force-threshold
       | stop to it.
        
         | mrob wrote:
         | Noctua heatsinks include maximum torque specifications in the
         | instruction manuals, in Nm. I expected this to be standard, but
         | I checked the manuals of popular high-end models from other
         | manufacturers (Cooler Master, DeepCool, Be Quiet, Akasa), and I
         | was unable to find torque specifications.
        
         | omneity wrote:
         | How can you make use of this info? Like which tool allows you
         | to tighten down to a specific torque?
        
           | bogdanstanciu wrote:
           | Torque wrenches let you set a max torque and will spin
           | freely+ no longer tighten after they reach it
        
           | mikeInAlaska wrote:
           | I have a Klein Torque screwdriver. Holy cow have they become
           | expensive in the decade since I bought it.
        
           | mrob wrote:
           | Torque screwdriver. You can get digital motor-driven ones,
           | and ones that work like torque wrenches, with a clutch that
           | releases once you exceed the torque setting. Torque
           | screwdrivers have a limited range of supported torques, and
           | computer heatsink fasteners generally need fairly low torque,
           | so be sure to get one that goes low enough.
        
           | mkaic wrote:
           | I just completed a few AMD Threadripper-based builds at work
           | and was pleasantly surprised to see that each CPU actually
           | shipped with its own torque-screwdriver tuned to the specific
           | torque needed to install the chip.
        
       | yieldcrv wrote:
       | we just saying AI accelerator now?
        
         | sp332 wrote:
         | If your Graphics Processing Unit isn't actually processing any
         | graphics, it seems like a better name. No one is gaming on an
         | A100 (although now I want to see that!)
        
           | circuit10 wrote:
           | I think Linus Tech Tips tried
        
       | aquir wrote:
       | ESET Endpoint Security is blocking the site: JS/Agent.RAN threat
       | found.
        
       | [deleted]
        
       | danielovichdk wrote:
       | Funny enough, no matter how big these things get, they never seem
       | to make me a lot more productive.
       | 
       | If my CI build time at least would go down.
       | 
       | Software is so slow compared to hardware. It's embarrassing that
       | we haven't moved not even a hundredth of what hardware has the
       | last 30 years.
       | 
       | Why get this?
        
         | redox99 wrote:
         | > Funny enough, no matter how big these things get, they never
         | seem to make me a lot more productive.
         | 
         | Maybe because the stuff you do isn't bottlenecked by compute?
         | In my case every hardware upgrade resulted in a big
         | productivity improvement.
         | 
         | Better CPU: Cut my C++ build times in half (from 10 to 5
         | minutes if I change an important .h)
         | 
         | Better GPUs: Cut my AI training time by a few X, massively
         | improving iteration times. Also allow me to run bigger models
         | that more easily reach my target accuracy.
        
         | switchbak wrote:
         | I get your take here, but as someone who's worked very hard at
         | times to optimize builds (amongst other things), the business
         | just generally doesn't respect those efforts and certainly
         | doesn't reward them. Often times they're actively punished with
         | a reflexive assumption that they're not "serious" efforts worth
         | the time of the business. (There's the odd exception, but this
         | is very widespread in my experience)
         | 
         | Sure, there's a balance to be made between cutting wood and
         | sharpening the saw. Who do we blame when the boss-man won't
         | allow anyone to sharpen the tools even though we're obviously
         | wasting outrageous amounts of time? You blame the people that
         | won't allow those investments to be made.
         | 
         | When you multiply that across an entire industry, add some
         | trendy fashionable tech (that's also just fast-enough to be
         | tolerable), and this is how we end up in the shitty
         | circumstance you describe.
         | 
         | And yet I still wouldn't trade my fancy IDE and slow CI
         | pipelines for a copy of Turbo Pascal 7, as fast as it would be!
        
         | jiggawatts wrote:
         | People get set in their ways. Sometimes entire industries need
         | a shake-up.
         | 
         | This never occurs voluntarily, and people will wail and thrash
         | about even as you try to help them get out of their rut.
         | 
         | Builds being slow is one of my pet peeves also. Modern "best
         | practices" are absurdly wasteful of the available computer
         | power, but because everyone does it the same way, nobody seems
         | to accept that it can be done differently.
         | 
         | A typical modern CI/CD pipeline is like a smorgasboard of
         | worst-case scenarios for performance. Let me list just _some_
         | of them:
         | 
         | - Everything is typically done from scratch, with minimal or no
         | caching.
         | 
         | - Synchronous I/O from a single thread, often to a remote
         | cloud-hosted replicated disk... for an ephemeral build job.
         | 
         | - Tens of thousands of tiny files, often smaller than the
         | physical sector size.
         | 
         | - Layers upon layers of virtualisation.
         | 
         | - Many small HTTP downloads, from a single thread with no
         | pipelining. Often un-cached despite being identified by stable
         | identifiers -- and hence infinitely cacheable safely.
         | 
         | - Spinning up giant, complicated, multi-process workflows for
         | trivial tasks such as file copies. (CD agent -> shell -> cp
         | command) Bonus points for generating more kilobytes of logs
         | than the kilobytes of files processed.
         | 
         | - Repeating the same work over and over (C++ header
         | compilation).
         | 
         | - Generating reams of code only to shrink it again through
         | expensive processes or just throw it away (Rust macros).
         | 
         | I could go on, but it's too painful...
        
           | jrockway wrote:
           | A few weeks ago, I decided to lock myself in my apartment and
           | write a build system. I have always felt like developers wait
           | ages for Docker, and I wanted to see why. (Cycle times on the
           | app I develop are a minute for my crazy combination of shell
           | scripts that I use, or up to 5 minutes for what most of my
           | teammates do. This is incomprehensibly unproductive.)
           | 
           | It turns out, it's all super crazy at every level. Things
           | like Docker use _incredibly_ slow algorithms like SHA256 and
           | gzip by default. For example, it takes 6 seconds to gzip a
           | 150MB binary, while zstd -fast=3 achieves the same ratio and
           | does it in 100 milliseconds! The OCI image spec allows
           | Zstandard compression, so this is something you can just do
           | to save build time and container startup time. (gzip,
           | unsurprisingly, is not a speed demon when decompressing
           | either.) SHA256, used everywhere in the OCI ecosystem, is
           | also glacial; a significant amount of CPU used by starting or
           | building containers is just running this algorithm. Blake3 is
           | 17 times faster! (Blake2b, a fast and more-trusted hash than
           | blake3, is about 6x faster.) But unfortunately, Docker /OCI
           | only support SHA256, so you are stuck waiting every time you
           | build or pull a container. (On the building side, you
           | actually have to compute layer SHA256s twice; once for the
           | compressed data and once for the uncompressed data. I don't
           | know what happens if you don't do this, I just filled in
           | every field the way the standard mandated and things worked.)
           | 
           | This was on HN a couple years ago and was a real eye opener
           | for me:
           | https://jolynch.github.io/posts/use_fast_data_algorithms/
           | 
           | There are also things that Dockerfiles preclude, like
           | building each layer in parallel. I don't use layers or shell
           | commands for anything; I just put binaries into a layer. With
           | a builder that doesn't use Dockerfiles, you can build all the
           | layers in parallel and push some of the earlier layers while
           | the later ones are building. (One of the reasons I wrote my
           | own image assembler is because we produce builds for each
           | architecture. The build machine has to run an arm64 qemu
           | emulator so that a Dockerfile-based build can run `[` to
           | select the right third-party binary to extract. This is crazy
           | to me; the decision is static and unchanging, so no code
           | needs to be run. But I know that it's designed for stuff like
           | "FROM debian; RUN apt-get update; RUN apt-get upgrade" which
           | is ... not needed for anything I do.)
           | 
           | The other thing that surprises me about pushing images is how
           | slow a localhost->localhost container push is. I haven't
           | looked into why because the standard registry code makes me
           | cry, but I plan to just write a registry that stores blobs on
           | disk, share the disk between the build environment and the
           | k8s cluster (hostpath provisioner or whatever), and have the
           | build system just write the artifacts it's building into that
           | directory; thus there is no push step required. When the
           | build is complete, the artifacts are available for k8s to
           | "pull".
           | 
           | The whole thing is a work in progress, but with a week of
           | hacking I got the cycle time down from 1 minute to about 5
           | seconds, and many more improvements are available.
           | (Eventually I plan to build everything with Bazel and remote
           | execution, but I needed the container builder piece for
           | multi-architecture releases; Bazel will have to be invoked
           | independently for each architecture because of its design,
           | and then the various artifacts have to be assembled into the
           | final image list.)
        
             | oconnor663 wrote:
             | > Blake3 is 17 times faster! (Blake2b, a fast and more-
             | trusted hash than blake3, is about 6x faster.)
             | 
             | I'm a little surprised to see that 6x figure. Just going
             | off the red bar chart at blake2.net, I wouldn't expect to
             | see much more than a 2x difference, unless you're measuring
             | a sub-optimal SHA256 implementation. And recent x86 CPUs
             | have hardware acceleration for SHA256, which makes it
             | faster than BLAKE2b. But those CPUs also have wide vector
             | registers and lots of cores, so BLAKE3's relative advantage
             | tends to grow even as BLAKE2b falls behind.
             | 
             | But in any case, yes, builds and containers tend to be
             | great use cases for BLAKE3. You've got big files that are
             | getting hashed over and over, and they're likely to be in
             | cache. An expensive AWS machine can hit crazy numbers like
             | 100 GB/s on that sort of workload, where the bottleneck
             | ends up being memory bandwidth rather than CPU speed.
        
               | jrockway wrote:
               | Yeah, I'm not using any special implementation of sha256.
               | Just crypto/sha256 from the standard library. It's also
               | worth noting that I'm testing on a zen2 chip. I have read
               | a lot of papers saying that sha2 is much faster because
               | CPUs have special support; didn't see it in practice. (No
               | AVX-512 here, but I believe those algorithms predate
               | AVX-512 anyway.)
        
           | api wrote:
           | We've gotten very good at using containerization and
           | virtualization to isolate and abstract away ugliness, and as
           | a result we have been able to build unspeakable towers of
           | horror that would have been impossible without these
           | innovations.
        
           | mschuster91 wrote:
           | > Builds being slow is one of my pet peeves also. Modern
           | "best practices" are absurdly wasteful of the available
           | computer power, but because everyone does it the same way,
           | nobody seems to accept that it can be done differently.
           | 
           | A lot of what you describe originates from lessons learned in
           | "classic" build environments:
           | 
           | - broken (or in some cases, regular) build attempts leaving
           | files behind that confuse later build attempts (e.g. because
           | someone forgot to do a git clean before checkout step)
           | 
           | - someone hot-fixing something on a build server which never
           | got documented and/or impacted other builds, leading to long
           | and weird debugging efforts when setting up more build
           | servers
           | 
           | - its worse counterpart, someone setting up the environment
           | on a build server and never documenting it, leading to
           | serious issues when that person inevitably left and then
           | something broke
           | 
           | - OS package upgrades breaking things (e.g. Chrome/FF
           | upgrades and puppeteer using it), and a resulting reluctancy
           | in upgrading build servers' software
           | 
           | - attackers hacking build systems because of vulnerabilities,
           | in the worst case embedding malware into deliverables or
           | stealing (powerful) credentials
           | 
           | - colliding versions of software stacks / libraries / other
           | dependencies leading to issues when new projects are to be
           | built on old build servers
           | 
           | In contrast to that, my current favourite system of running
           | GitLab CI on AWS EKS with ephemeral runner pods is orders of
           | magnitude better:
           | 
           | - every build gets its own fresh checkout of everything, so
           | no chance of leftover build files or an attacker persisting
           | malware without being noticed (remember, everything comes out
           | of git)
           | 
           | - no SSH or other access to the k8s nodes possible
           | 
           | - every build gets a _reproducible_ environment, so when
           | something fails in a build, it 's trivial to replicate
           | locally, and all changes are documented
        
             | gpderetta wrote:
             | You forgot: - you can get lunch while you wait for your
             | build to finish.
        
               | mschuster91 wrote:
               | Not if your pipeline is decent. Parallelize and cache as
               | much as you can.
               | 
               | The only stack that routinely throws wrenches into
               | pipeline optimization is Maven. I'd love to run, say,
               | Sonarqube, OWASP Dependency Checker, regular unit tests
               | and end-to-end tests in parallel in different containers,
               | but Maven - even if you pass the entire target / */target
               | folders through - insists on running all steps prior to
               | the goal you attempt to run. It's not just dumb and slow,
               | it makes the runner images large AF because they have to
               | carry _everything_ needed for all steps in one image,
               | including resources like RAM and CPU.
        
             | ilyt wrote:
             | Right but build environment are awfully stupid about it.
             | Re-downloading deps _when they did not change_ is utterly
             | wasteful that achieves nothing, same as re-compiling stuff
             | that did not change.
             | 
             | > broken (or in some cases, regular) build attempts leaving
             | files behind that confuse later build attempts (e.g.
             | because someone forgot to do a git clean before checkout
             | step)
             | 
             | But thanks to CI you will never fix such broken build
             | system!.
        
             | justinclift wrote:
             | > OS package upgrades breaking things
             | 
             | Heh. docker.io package on Ubuntu did this recently, whereby
             | it stopped honouring the "USER someuser" clause in
             | Dockerfiles. Completely breaks docker builds.
             | 
             | No idea if it's fixed yet, we just updated our systems to
             | not pull in docker.io 20.10.25-0ubuntu1~20.04. or newer.
        
               | ilyt wrote:
               | Docker developers being clueless, what else is new...
        
             | slowmovintarget wrote:
             | Indeed: Optimize to reduce developer time spent on bad
             | builds first.
             | 
             | One of the rules of this approach is to filter all
             | extraneous variable input likely to disrupt the build
             | results, especially and including artifacts from previous
             | failed builds.
        
             | jiggawatts wrote:
             | You're the perfect example of the cranky experienced
             | developer stuck in their ways and fighting against better
             | solutions.
             | 
             | Most of the issues you've described are consequences of
             | _yet more issues_ such as not caching with the correct
             | cache key.
             | 
             | I argue that all of the problems are eminently solvable.
             | You're arguing for leaving massive issues in the system
             | because of... other massive issues. Only one of these two
             | approaches to problem solving gets to a solution without
             | massive problems.
        
         | jayd16 wrote:
         | Yeah, I remember compressing 4k120hz video of my AI upscaled
         | ray traced RPG play through in real time and streaming it to
         | all my friends' phones in the 90s. Times never change.
         | 
         | But really, it's easy to forget the massive leaps we make every
         | year.
        
           | josephg wrote:
           | Maybe I'm just getting old, but diablo 4 doesn't look that
           | much better to my eyes than diablo 3 did. I'm sure the
           | textures are higher resolution and so on, but 5 minutes in I
           | don't notice or care. Its how the game plays that matters,
           | and that has almost never been hardware limited.
           | 
           | That said, I'm really looking forward to the day we can embed
           | LLMs and generative AI into video games for like world
           | generation & dialog. I can't wait to play games where on
           | startup, the game generates truly unique cities for me to
           | explore filled with fictional characters. I want to wander
           | around unique worlds and talk - using natural language - with
           | the people who populate them.
           | 
           | I'm giddy with excitement over the amazing things video games
           | can bring in the next few years. I feel like a kid again
           | looking forward to christmas.
        
           | hypercube33 wrote:
           | there was a video posted to hacker news about core windows
           | apps taking longer to launch on new enough hardware. Stuff
           | like calc, word, paint taking longer to start on win11 vs win
           | 2000 even though machines have much faster everything so I
           | think the point stands - why is everything slower now?
        
         | moffkalast wrote:
         | https://en.wikipedia.org/wiki/Wirth%27s_law
        
         | joseph_grobbles wrote:
         | It's a variation of Parkinson's Law -- we just keep expanding
         | what we are doing to fit the hardware available to us, then
         | claiming that nothing has changed.
         | 
         | CI is a fairly new thing. The idea of constantly doing all that
         | compute work again and again was unfathomable not that long ago
         | for most teams. We layer on and load in loads of ancillary
         | processes, checks, lints, and so on, because we have the
         | headroom. And then we reminisce about the days when we did a
         | bi-monthly build on a "build box", forgetting how minimalist it
         | actually was.
        
           | jebarker wrote:
           | This is true, but there's still choice in how we expand and
           | the default seems to be to do it as wastefully as possible.
        
             | pbjtime wrote:
             | It's what the economy rewards. Simple as that
        
               | jebarker wrote:
               | I don't totally agree. It's what a short-term view of the
               | economy rewards for sure. But even if that was the only
               | view of the economy I've seen plenty of low-performance
               | software written purely out of cargo culting and/or
               | inability or lack of will to do anything better.
        
         | Aurornis wrote:
         | > Software is so slow compared to hardware. It's embarrassing
         | that we haven't moved not even a hundredth of what hardware has
         | the last 30 years
         | 
         | I don't understand this mentality. What, exactly, did you
         | expect to get faster? If you run the same software on older
         | hardware it's going to be much slower. We're just doing more
         | because we can now.
         | 
         | From my perspective, things are pretty darn fast on modern
         | hardware compared to what I was dealing with 5-10 years ago.
         | 
         | I had embedded systems builds that would take hours and hours
         | on my local machine years ago. Now they're done in tens of
         | minutes on my machine that uses a relatively cheap consumer
         | CPU. I can clean build a complete kernel in a couple minutes.
         | 
         | In my text editor I can do a RegEx search across large projects
         | and get results nearly instantly! Having NVMe SSDs and high
         | core count consumer CPUs makes amazing things possible.
         | 
         | Software is improving, too. Have you seen how fast the new Bun
         | package manager is? I can pick from dozens of open source
         | database options for different jobs that easily enable high
         | performance, large scale operations that would have been
         | unthinkable or required expensive enterprise software a decade
         | ago (even with today's hardware).
         | 
         | > Why get this?
         | 
         | If you really think nothing has improved, you might be
         | experiencing a sort of hedonic adaptation: Every advancement
         | gets internalized as the new baseline and you quickly forget
         | how slow things were previously.
         | 
         | I remember the same thing happened when SSDs came out: They
         | made an amazing improvement in desktop responsiveness over
         | mechanical HDDs, but many people almost immediately forgot how
         | slow HDDs were. It's only when you go back and use a slow HDD-
         | based desktop that you realize just how slow things were in the
         | past.
        
           | josephg wrote:
           | > We're just doing more because we can now.
           | 
           | Are we though?
           | 
           | Our computers are orders of magnitude faster than they were.
           | What new features justify consuming 100x or more CPU, RAM,
           | network and disk space?
           | 
           | Is my email software doing orders of magnitude more work to
           | render an email compared to the 90s? Does discord have _that_
           | many more features compared to mIRC that it makes sense for
           | it to take several seconds to open on my 8 core M1 laptop?
           | For reference, mIRC was a 2mb binary and I swear it opened
           | faster on my pentium 2 than discord takes to open on my 2023
           | laptop. By the standards of 1995 we all walk around with
           | supercomputers in our pockets. But you wouldn 't know it,
           | because the best hardware in the world still can't keep pace
           | with badly written software. As the old line from the 1990s
           | goes, "what Andy giveth, Bill taketh away."[1] (Andy Grove
           | was CEO at the time of Intel.)
           | 
           | My instinct is that as more and more engineers work "up the
           | stack" we're collectively forgetting how to write efficient
           | code. Or just not bothering. Why optimize your react web app
           | when everyone will have new phones in a few years with more
           | RAM? If the users complain, blame them for having old
           | hardware.
           | 
           | I find this process deeply disrespectful to our users. Our
           | users pay thousands of dollars for good computer hardware
           | because they want their computer to run fast and well. But
           | all of that capacity is instead chewed through by developers
           | the world over trying to save a buck during development time.
           | Every hardware upgrade our users make just becomes the new
           | baseline for how lazy we can be.
           | 
           | Slow CI/CD pipelines are a completely artificial problem.
           | There's absolutely no technical reason that they need to run
           | so slowly.
           | 
           | [1] https://en.wikipedia.org/wiki/Andy_and_Bill%27s_law
        
           | rwmj wrote:
           | And yet, just a few minutes ago (the time it took to reboot
           | my laptop), I clicked open the "Insert" menu in a Google doc
           | and the machine hung. Even a 128k Macintosh with a 68000 CPU
           | could handle that.
        
             | sp332 wrote:
             | ClarisWorks would regularly hang my Mac Classic II. And
             | with cooperative multitasking, I had to reboot the machine.
        
           | Almondsetat wrote:
           | why is opening word or excel or powerpoint not instantaneous?
           | plenty of people, me included, have to sieve through vast
           | amounts of document and constantly open/close them
        
             | JohnBooty wrote:
             | Application "bloat" is one obvious thing, but operating
             | systems are also doing a lot more "work" to open an app
             | these days -- address space randomization, checking file
             | integrity, checking for developer signing certs,
             | mitigations against hardware side channel attacks like
             | rowhammer or whatever, etc.
             | 
             | Those things aren't free. But, I don't know the relative
             | performance hit there compared to good old software bloat.
        
               | soulbadguy wrote:
               | > address space randomization, checking file integrity,
               | checking for developer signing certs, mitigations against
               | hardware side channel attacks like rowhammer or whatever,
               | etc.
               | 
               | OS/Application have been slower way before any of those
               | were a thing.
        
             | mejutoco wrote:
             | Maybe they are waiting to make them right before making
             | them fast? /s
             | 
             | (Make it work, make it right, make it fast)
        
       | [deleted]
        
       | joseph_grobbles wrote:
       | This is a really terribly written article. Sorry for the
       | negativity, but it was tough to try to dig through.
       | 
       | For those confused by the title, Intel released a Xeon that
       | includes 64GB of high speed RAM on the chip itself, configurable
       | as either primary or pooled memory, or a memory subsystem caching
       | layer.
        
         | mutagen wrote:
         | I suspect it was the video transcript or a lightly edited
         | version of the transcript.
        
         | CobaltFire wrote:
         | I came here to say the same thing; usually StH is very
         | readable. This has to either be a video transcript or some AI
         | pruning (maybe a combination) because it reads absolutely
         | horribly.
        
         | neolefty wrote:
         | Just as gaming hardware has taken over compute, gaming
         | journalism must take over reporting!
        
         | denton-scratch wrote:
         | I also found it hard to read; I wanted to know how one might
         | use this thing, but instead I learned all about channels,
         | "wings", backplanes, and saw a lot of tables and photos that
         | seem to be largely duplicates. An entire page (out of 5 pages)
         | was dedicated to examining the development system.
        
       | ipython wrote:
       | I couldn't figure out what the _actual_ measured memory bandwidth
       | to the onboard HBM2e memory chiplets are- how does it compare to,
       | say, Apple 's M2 Ultra?
        
         | opencl wrote:
         | According to this paper[0] ~1600GB/s raw memory bandwidth,
         | ~590GB/s in practice. I haven't seen any actual benchmarks of
         | M2 Ultra CPU bandwidth, but the M1 Ultra has been benchmarked
         | at ~240GB/s[1].
         | 
         | [0]
         | https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi...
         | 
         | [1]
         | https://macperformanceguide.com/blog/2022/20220618_1934-Appl...
        
           | doctorpangloss wrote:
           | > The cores used in the Xeon Max processors do not support
           | enough concurrency to reach full HBM bandwidth with the
           | current core counts. 2x more concurrency or 2x more cores.
           | 
           | is a good punchline from the report. The Apple cores have
           | that problem but a lot worse. They are slow. It goes back to
           | this flawed idea in the community that you can "just" "add"
           | "more memory," when parts like the H100 have their memory
           | size matched to the architecture (physical and software) of
           | the dozens of CPUs on them.
           | 
           | I'm not sure why the conception persists that a 60W laptop
           | part would be comparable to a 300W server part of the same
           | process generations, let alone this particular part.
        
           | [deleted]
        
           | foota wrote:
           | That's insane.
        
         | JacobiX wrote:
         | The M1 Ultra has 800GB/s of memory bandwidth, on contrast HBM2E
         | has 204.8 Gbps x 2 = 409.6 Gb/s
        
           | consp wrote:
           | Is that available to any core or just a sum-all-up and has
           | latency penalties when going around?
        
             | hypercube33 wrote:
             | depends on NUMA node config so I believe this is combined
             | on the whole chip if all cores are working on threads with
             | their local 16GB HBM, theoretically.
        
           | wmf wrote:
           | GPUs use 4-6 stacks of HBM2 which is 1,840-2,760 GB/s. It's
           | 2x-3x the bandwidth of M1/M2 Ultra.
        
           | Brian-Puccio wrote:
           | So 6,400 Gb/s compared to 409.6 Gb/s once you convert units?
        
       | ftxbro wrote:
       | according to google it costs thirteen thousand dollars
        
         | Aissen wrote:
         | It's less expensive than the public price of the biggest
         | Genoa-X with 1.15GB of L3 : 9684X at $14k+.
        
         | yvdriess wrote:
         | That's high end Xeon for you. At least here you can argue it
         | can save you from buying a GPU.
        
           | varelse wrote:
           | NUMAAF I guess, but it doesn't seem like they did anything
           | interesting with it yet.
        
         | RobotToaster wrote:
         | Ouch, that's about the same as two A100s, any idea how it would
         | compare?
        
           | Aissen wrote:
           | If you buy 40GB A100s, maybe, but LLMs have pushed the price
           | of the 80GB A100s in the same range (they're more expensive
           | now than at launch(!)).
        
           | llm_nerd wrote:
           | While the title bizarrely references GPUs, this is a normal
           | general computation CPU that has 64GB of very high speed
           | memory on the chip itself.
        
         | luma wrote:
         | That's datacenter compute for you, and they'll sell a bunch of
         | them. One simple reason (of many) - a lot of enterprise
         | licenses are tied to CPU count and those licenses can be
         | extremely expensive. You can save money on the licenses by
         | buying fewer, higher speed CPUs, and in many cases it can
         | offset the increase in CPU cost.
        
           | ftxbro wrote:
           | > a lot of enterprise licenses are tied to CPU count and
           | those licenses can be extremely expensive.
           | 
           | if this pricing system is making an inefficient market for
           | cpus then maybe it could be disrupted somehow
        
       | bhouston wrote:
       | Intel took on Apple's Max and Ultra nomenclature I see.
        
         | usrusr wrote:
         | I wonder if it might be a deliberate tip of the hat smuggled
         | into the creative process by engineering: if I was at Intel I'd
         | _desperately_ hope that much of the Apple Silicon performance
         | comes from their on-module memory and this HBM2 monster looks
         | exactly like something you might come up with if that hope got
         | you thinking.
        
           | tubs wrote:
           | Apple uses the same memory packaging every mobile soc uses
           | (and has used some flip chip and pop arrived). There is
           | nothing super special about the ram or how it's wired. This
           | myth has to cease.
           | 
           | The only thing you can say about the larger chips is they
           | have much wider channels, but they are still just regular
           | DDRs.
        
       ___________________________________________________________________
       (page generated 2023-09-21 23:01 UTC)