[HN Gopher] Nvidia Hopper GPU Architecture and H100 Accelerator
       ___________________________________________________________________
        
       Nvidia Hopper GPU Architecture and H100 Accelerator
        
       Author : jsheard
       Score  : 214 points
       Date   : 2022-03-22 15:52 UTC (7 hours ago)
        
 (HTM) web link (www.anandtech.com)
 (TXT) w3m dump (www.anandtech.com)
        
       | lmeyerov wrote:
       | Seeing the increased bandwidth is super exciting for a lot of
       | business analytics cases we get into for
       | IT/security/fraud/finance teams: imagine correlating across lots
       | of event data from transactions, logs, ... . Every year, it just
       | goes up!
       | 
       | The big welcome surprise for us the secure virtualization.
       | Outside of some limited 24/7 ML teams, we mostly see bursty
       | multi-tenant scenarios for achieving cost-effective utilization.
       | MiG etc static physical partitioning was interesting -- I can
       | imagine cloud providers giving that -- but more dynamic & logical
       | isolation, with more of a focus on namespace isolation, is more
       | relevant to what we see. Once we get into federated learning, and
       | further disintermediations around that, even more cool. Imagine
       | bursting on 0.1-100 GPUs every 30s-20min. Amazing times!
        
       | algo_trader wrote:
       | From the nvidia page,
       | 
       | > 80 billion transistors
       | 
       | > Hopper H100 .. generational leap
       | 
       | > 9x at-scale training performance over A100
       | 
       | > 30x LLM inference throughput
       | 
       | > Transformer Engine .. speed .. 6x without losing accuracy
       | 
       | So another monster chip - same size of the Apple M1-max thingy ..
       | 
       | I guess it comes down to pricing. The A100 is already
       | ridiculously expensive at $10K. They can this one at $50K and it
       | would sell out?
        
         | p1esk wrote:
         | You can buy A100s in a server today, a number of integrators
         | will happily sell it to you.
        
           | chockchocschoir wrote:
           | As someone who've tried for some weeks, it really seems like
           | it's out-of-stock literally everywhere. The demand seems to
           | be a lot higher than the supply at the moment, so much that
           | I'm considering buying one myself instead of renting servers
           | with it.
        
             | nqnielsen wrote:
             | And even if vendors does say they have it, or can get it,
             | it ended up taking us 4-6 months before systems were
             | online.
        
             | kovek wrote:
             | Does it make sense that all the GPUs are bought out? They
             | each provide a return for mining in the short-term. In the
             | long term, they can be used to run A(G)I models, which will
             | be very very useful
        
               | fennecfoxen wrote:
               | This is the GPU the parent is talking about
               | https://www.nvidia.com/en-us/data-center/a100/
        
               | kovek wrote:
               | This still makes sense! TPUs are useful for AI, which
               | itself will be very very useful. It's almost like it's
               | the best investment. That's why smart players buy them
               | all. Maybe I'm going out-of-topic.
        
             | p1esk wrote:
             | Did you check Lambda or Exxact?
        
               | chockchocschoir wrote:
               | Yes, nor Lambda Labs or Exxact Corporation have them
               | available last time I checked (last week). Both citing
               | high demand as the reason for it being unavailable.
        
               | asciimike wrote:
               | Howdy, I run [Crusoe Cloud](https://crusoecloud.com/) and
               | we just launched an alpha of an A100 and A40 Cloud
               | offering--we've got capacity at a reasonable price!
               | 
               | If you're interested in giving us a shot, feel free to
               | shoot me an email at mike at crusoecloud dot com.
        
               | sabalaba wrote:
               | We (Lambda) have all of the different NVIDIA GPUs in
               | stock ---- can you send a message to sales@lambdalabs.com
               | and check in again with your requirements? We're seeing a
               | lot more stock these days as the supply chain crisis of
               | 2021 comes to an end.
        
         | Uehreka wrote:
         | Most people who use one of these will be doing so through an
         | EC2 VM (or equivalent). Given that cloud platforms can spread
         | load, keep these GPUs churning close to 24/7 and more easily
         | predict/amortize costs, they'll probably buy the amount that
         | they know they need, and Nvidia probably has some approximately
         | correct idea of what that number is.
        
       | spoonjim wrote:
       | Off topic but I can't stand when corporations use actual people's
       | names for their marketing who never gave them the permission to
       | do so. For something like Shakespeare or Cicero I'm OK with it
       | but Grace Hopper was alive in my lifetime, and even Tesla feels a
       | little weird. What gives you the right to use that person's
       | reputation to shill your product?
        
         | paxys wrote:
         | > What gives you the right to use that person's reputation to
         | shill your product?
         | 
         | Practically speaking you have the right to do anything unless
         | someone complains about it. A lot of popular figures, even
         | those long dead, have estates and organizations that manage
         | their likeliness and other related copyright and IP. IDK what
         | the situation is in this case, but Nvidia may very well have
         | paid for the name.
        
           | erosenbe0 wrote:
           | The situation is that various Australian companies (think
           | Kangaroo) and DISH network already have Hopper product lines
           | and Nvidia didn't care about getting into a legal kerfuffle
           | and used the name anyway. As to whether Hopper's estate was
           | consulted I don't know.
        
           | cosmiccatnap wrote:
        
           | spoonjim wrote:
           | I don't think my kids have any more right to use my name than
           | a corporation, unless I specifically grant them that right
           | (like Walt Disney did by naming it the Walt Disney company).
           | Another sickening one is the Ed Lee Club in SF, who endorses
           | political candidates under the name of a much-loved dead SF
           | mayor.
        
             | paxys wrote:
             | Your kids have the right to everything you own (
             | _including_ your name) by default unless you take steps to
             | change that, say using a will or estate.
        
               | spoonjim wrote:
               | Yes, I know, I'm saying that it should not be that way.
               | Rights to your likeness should end at your death unless
               | you specifically write down otherwise.
        
               | oblio wrote:
               | Do you have kids?
        
               | spoonjim wrote:
               | Yes.
        
               | oblio wrote:
               | And you think they shouldn't have that right because of
               | social concerns like accumulation of wealth?
        
         | erosenbe0 wrote:
         | Theranos' "Edison" machine enters the chat...
        
         | foolfoolz wrote:
         | what gives you the right to own a name ever? especially once
         | your dead?
        
         | eigenvalue wrote:
         | I generally agree with you, but in this case I suspect Grace
         | Hopper would be honored by it and also impressed with the
         | engineering here. It's not like they slapped her name on a soda
         | can or something.
        
       | gautamcgoel wrote:
       | This chip is capable of 2000 INT8 Tensor TOPS, or 1000 F16 Tensor
       | TFLOPS. In other words, it is capable of performing over a
       | quadrillion operations per second. Absolutely insane... I still
       | have fond memories of installing my first NVidia gaming GPU, with
       | just 512MB of RAM, probably capable of much less than a single
       | teraflop of compute.
        
       | Rafuino wrote:
       | Good lord 700W TDP!
        
       | Symmetry wrote:
       | NVidia and AMD datacenter GPUs continue to diverge between
       | focusing on deep learning and traditional scientific computing
       | respectively.
        
       | orangebeet wrote:
       | This would be cool if they had decent drivers for Linux.
        
         | why_only_15 wrote:
         | They do have good drivers for Linux for the things this chips
         | is intended to be used for (research, ML).
        
           | johndough wrote:
           | If you haven't had any issues with NVIDIA Linux drivers, you
           | can count yourself extremely lucky. In the past, I had a
           | 50/50 chance of boot failure after installing CUDA drivers
           | over 12 different systems. Mainline Ubuntu drivers are
           | somewhat stable, but installing a specific CUDA version from
           | the official NVIDIA repos rarely works on the first try.
           | Switching from Tensorflow to PyTorch has helped a lot though,
           | as Tensorflow was much more picky about the installed CUDA
           | version.
           | 
           | Obligatory Linus Torvals on NVIDIA:
           | https://www.youtube.com/watch?v=_36yNWw_07g
        
         | paxys wrote:
         | I can assure you systems that take advantage of this chip for
         | scientific/ML workloads aren't running Windows.
        
           | flatiron wrote:
           | they may have edited their comment but they were commenting
           | on the lack of quality of their Linux drivers (which I agree
           | with but only on a consumer level, never used nvidia in a
           | server)
        
       | obeliskora wrote:
       | Anyone find details about the DPX instructions for dynamic
       | programming?
        
         | pjmlp wrote:
         | There are some deep dive sessions at GTC that will probably go
         | into them.
        
       | trollied wrote:
       | Looking forward to seeing Doom run on this.
        
         | [deleted]
        
       | minimaxir wrote:
       | So the product naming for Nvidia's server-GPUs by compute power
       | now goes:
       | 
       | P100 -> V100 -> A100 -> H100
       | 
       | This is not confusing at all.
        
         | torginus wrote:
         | I think this is less of an issue since these GPUs are not meant
         | for the everyman, so basically the handful of server
         | integrators can figure this out by themselves.
         | 
         | And for your typical dev - they'll interact with the GPU
         | through a cloud provider, where they can easily know that a G5
         | instance is newer than a G4 one.
        
         | gtirloni wrote:
         | Isn't it based on the architecture name?
         | 
         | https://en.wikipedia.org/wiki/Category:Nvidia_microarchitect...
        
           | jsheard wrote:
           | Yeah it is, but unless you've memorised the history of Nvidia
           | architectures it doesn't tell you which is the newer one
           | 
           | Fermi -> Kepler -> Maxwell -> Pascal -> Volta (HPC only) ->
           | Turing -> Ampere -> Hopper (HPC only?) -> Lovelace?
        
             | modeless wrote:
             | Someone should make a game like "Pokemon or Big Data" [1]
             | except you have to choose which of two GPU names is faster.
             | Even the consumer naming is bonkers so there's plenty of
             | material there!
             | 
             | [1] http://pixelastic.github.io/pokemonorbigdata/
        
             | ksec wrote:
             | Isn't this the norm? Only AMD started the trend of naming
             | the uArch with Numbers as Zen 4 or RDNA 3 fairly recently.
             | With Intel it is Haswell > Broadwell > ..... Whatever Lake.
        
               | kergonath wrote:
               | Intel is using generation numbers in their marketing
               | materials. In the technical-oriented slide decks you'd
               | see things like "42th generation formerly named Bullshit
               | Creek" but they are not supposed to use that for sales.
               | And then actual part names like i9-42045K.
               | 
               | We keep using code names in discussions because the
               | actual names are ass backwards and not very descriptive.
        
               | jsheard wrote:
               | Usually the architecture name isn't the only
               | distinguishing feature of the product name, you don't
               | need to remember Intel codenames because a Core 12700 is
               | obviously newer than a Core 11700
               | 
               | Nvidia's accelerators are just called <Architecture
               | Letter>100 every time so if you don't remember the order
               | of the letters it's not obvious
               | 
               | They could have just named them P100, V200, A300 and H400
               | instead
        
               | 867-5309 wrote:
               | >you don't need to remember Intel codenames because a
               | Core 12700 is obviously newer than a Core 11700
               | 
               | J3710, 7th Gen J3060, 8th Gen
               | 
               | J4205, 8th Gen J4125, 9th Gen
               | 
               | i3-5005U, 5th Gen N5095, 10th Gen
               | 
               | i7-3770, 3rd Gen 3865U, 7th Gen N3060, 8th Gen
        
               | paulmd wrote:
               | And an AMD 5700U is older than a 5400U as well. A 3400G
               | is older than a 3100X. 3300X isn't really distinctive
               | from 3100X, both are quad-core configurations (but
               | different CCD/cache configurations, which is of course
               | the name doesn't really disclose to the consumer). It
               | happens, naming is a complex topic and there's a lot of
               | dimensions to a product.
               | 
               | In general, complaining about naming is peak bikeshedding
               | for the tech-aware crowd. There are multiple naming
               | schemes, all of them are reasonable, and everyone hates
               | some of them for completely legitimate reasons (but
               | different for every person). And the resulting
               | bikeshedding is exactly as you'd expect with that.
               | 
               | The underlying problem is that products have multiple
               | dimensions of interest - you've got architecture, big vs
               | small core, core count, TDP, clockrate/binning, cache
               | configuration/CCD configuration, graphics configuration,
               | etc. If you sort them by generation, then an older but
               | higher-spec can beat a newer but lower-spec. If you sort
               | by date then refreshes break the scheme. If you split
               | things out into series (m7 vs i7) to express TDP then
               | some people don't like that there's a bunch of different
               | series. If you put them into the same naming scheme then
               | some people don't like that a 5700U is slower than a
               | 5700X. If you try to express all the variables in a
               | single name, you end up with a name like "i7 1185G7"
               | where it's incomprehensible if you don't understand what
               | each of the parts of the name mean.
               | 
               | (as a power user, I personally think the Ice Lake/Tiger
               | Lake naming is the best of the bunch, it expresses
               | everything you need to know: architecture, core count,
               | power, binning, graphics. But then big.LITTLE had to go
               | and mess everything up! And other people still hated it
               | because it was more complex.)
               | 
               | There are certain ones like AMD's 5000 series or the
               | Intel 10th-gen (Comet Lake 10xxxU) that are just really
               | ghastly because they're deliberately trying to mix-and-
               | match to confuse the consumer (to sell older stuff as
               | being new), but in general when people complain about
               | "not understanding all those Lakes and Coves" it's
               | usually just because they aren't interested in the
               | brand/product and don't want to bother learning the
               | names, and they will eagerly rattle off a list of
               | painters or cities that AMD uses as their codenames.
               | 
               | Like, again, to reiterate here, I literally never have
               | seen anyone raise AMD using painter names as being
               | "opaque to the consumer" in the same way that people
               | repeatedly get upset about lakes. And it's the exact same
               | thing. It's people who know the AMD brand and don't know
               | the Intel brand and think that's some kind of a problem
               | with the branding, as opposed to a reflection of their
               | own personal knowledge.
               | 
               | I fully expect that AMD will release 7000 series desktop
               | processors this year or early next year, and exactly 0
               | people are going to think that a 7600 being newer than a
               | 7702 is confusing in the way that we get all these
               | aggrieved posts about Intel and NVIDIA. Yes, 7600 and
               | 7702 are different product lines, and that's the exact
               | same as your "but i7 3770 and N3060 are different!"
               | example. It's simply not that confusing, it takes less
               | time to learn than to make a single indignant post on
               | social media about it.
               | 
               | Similarly, the NVIDIA practice of using inventors/compsci
               | people is not particularly confusing either. Basically
               | the same as AMD with the painters/cities.
               | 
               | It's just not that interesting, and it's not worth all
               | the bikeshedding that gets devoted to it.
               | 
               | </soapbox>
               | 
               | Anyway, your example is all messed up though. J3710 and
               | J3060 are both the same gen (Braswell), launched at the
               | same time (Q1 2016), that example is entirely wrong.
               | J4125 vs J4205 is an older but higher specced processor
               | vs a newer but lower spec, it's a 8th gen Pentium vs a
               | 9th gen Celeron, like a 3100X vs a 2700X (zomg 3100X is
               | bigger number but actually slower!). And the J4125 and
               | J4205 are refreshes of the same architecture with
               | legitimately very similar performance classes. i3 and
               | Atom or i7 and Atom are completely different product
               | lines and the naming is not similar at all there, apart
               | from both having 3s as their first number (not even first
               | character, that is different too, just happen to share
               | the first number somewhere in the name).
               | 
               | Again, like with the Tiger Lake 11xxGxx naming, the
               | characters and positions in the name have meaning. You
               | can come up with better examples than that even within
               | the Intel lineup. Just literally picking 3770 and J3060
               | as being "similar" because they both have 3s in them.
               | 
               | The one I would legitimately agree on is that the Atom
               | lineup is kind of a mess. Braswell, Apollo Lake, Gemini
               | Lake, and Gemini Lake Refresh are all crammed into the
               | "3000/4000" series space, and there is no "generational
               | number" in that scheme either. Braswell is all 3000
               | series and Gemini Lake/Gemini Lake Refresh is all 4000
               | series but you've got Apollo Lake sitting in the middle
               | with both 3000 and 4000 series chips. And a J3455 (Apollo
               | Lake 1.5 GHz) is legitimately a better (or at least
               | equal) processor to a J3710 (Braswell 1.6 GHz). Like
               | 5700U vs 5800U, there are some legitimate architectural
               | differences behind hidden behind an opaque number there
               | (and on the Intel it's graphics - Gemini Lake/Gemini Lake
               | Refresh have a much better video block).
               | 
               | (And that's the problem with "performance rating"
               | approaches, even if a 3710 and a 3455 are similar in
               | performance there's still other differences between them.
               | Also, PR naming instantly turns into gamesmanship - what
               | benchmark, what conditions, what TDP, what level of
               | threading? Is an Intel 37000 the same as an AMD 37000?)
        
               | 867-5309 wrote:
               | yes, it's a bit of a shitshow, as mutually evidenced.
               | unless consumers brush up on such intricate details (most
               | do not), they will inevitably fall into traps such as "i7
               | is better than i3" e.g. i7-2600 being outperformed by
               | i3-10100 and "quad core is better than dual core".
               | marketing is becoming more focused on generations now
               | which is a prudent move: "10th Gen is better than 2nd
               | Gen" but it will be at least a decade before the shitshow
               | is swept
        
               | mywittyname wrote:
               | Intel was using Core i[3,5,7] names for multiple
               | generations. A Core i7 could be faster or slower than a
               | Core i5 depending on which generation each existed in.
               | 
               | It is nice when products have a naming scheme where
               | natural ordering of the name maps to performance.
        
               | numpad0 wrote:
               | We need a canonical, chronologically monotonic, marketing
               | independent ID scheme. Marketing people always tries to
               | disrupt naming schemes and that's the real problem.
        
               | bee_rider wrote:
               | I don't really mind the incomprehensible letters --
               | looking up the generation is pretty easy, and these are
               | data-center focused products... getting the name right is
               | somebody's job and the easiest possible thing.
               | 
               | However, is the number superfluous at this point?
        
           | neogodless wrote:
           | You just blew my mind. That did not occur to me, but it is
           | obvious in retrospect.
        
             | Uehreka wrote:
             | Sure, but the way Nvidia names generations is far from
             | obvious. It seems to be "names of famous scientists,
             | progressing in alphabetical order, we skip some letters if
             | we can't find a well known scientist with a matching last
             | name and are excited about a scientist 2 letters from now,
             | we wrap around to the beginning of the alphabet when we get
             | to the end, and we just skipped from A to H, so expect
             | another wraparound in the next 5-10 years."
        
               | [deleted]
        
         | ipsin wrote:
         | I mean, I'd already internalized P100 < V100 < A100 as a Colab
         | user.
         | 
         | Schedule me on an H100 and I promise I won't mind the
         | "confusing" naming.
        
           | jjoonathan wrote:
           | Also, the naming drives devs towards the architecture papers,
           | which are important if you want to get within sight of
           | theoretical perf. When NVidia changes the letter, it's like
           | saying "hey, pay attention, at least skim the new
           | whitepaper." Over the last decade, I feel like this
           | convention has respected my time, so in turn it has earned my
           | own respect. I'll read the Hopper whitepaper tonight, or
           | whenever it pops up.
        
       | nynx wrote:
       | Sounds like we need some new training methods. If training could
       | take place locally and asynchronously instead of globally through
       | backpropagation, the amount of energy could probably be
       | significantly reduced.
        
         | hwers wrote:
         | Trying to reduce energy consumption for ML like this is so
         | silly.
        
           | mlyle wrote:
           | Training costs are growing exponentially bigger.
           | 
           | The degree to which energy and capital costs can be optimized
           | will determine how large they can go.
        
           | thfuran wrote:
           | Why?
        
           | oblio wrote:
           | Reducing energy consumption for computation is not silly.
           | 
           | We're at a point we we're turning into a computation driven
           | society and computation is becoming a globally relevant power
           | consumption aspect.
           | 
           | > global data centers likely consumed around 205 terawatt-
           | hours (TWh) in 2018, or 1 percent of global electricity use
           | 
           | And that's just data centers, if you add all client devices
           | you probably double that.
           | 
           | Plus that number will only continue to grow.
        
         | moinnadeem wrote:
         | Disclosure: I work at MosaicML
         | 
         | Yeah, I strongly agree. While Nvidia is working on better
         | hardware (and they're doing a great job at it!), we believe
         | that better training methods should be a big source of
         | efficiency. We've released a new PyTorch library for efficient
         | training at http://github.com/mosaicml/composer.
         | 
         | Our combinations of methods can train CV models ~4x faster to
         | the same accuracy on CV tasks, and ~2x faster to the same
         | perplexity/GLUE score on NLP tasks!
        
           | jwuphysics wrote:
           | I've been seeing a lot more about MosaicML on my Twitter
           | feed. Just wanted to ask -- how are your priorities different
           | than, say, Fastai?
        
         | zozbot234 wrote:
         | The principled way of doing this is via ensemble learning,
         | combining the predictions of multiple separately-trained
         | models. But perhaps there are ways of improving that by
         | including "global" training as well, where the "separate"
         | models are allowed to interact while limiting overall training
         | costs.
        
       | captainbland wrote:
       | Those specs imply some pretty crazy architectural efficiency
       | gains, massive theoretical compute performance per transistor
       | compared to Ampere. It's all marketing numbers until the
       | benchmarks are out, though.
       | 
       | Edit: big TDP, though.
        
       | ksec wrote:
       | And may be taking this opportunity to ask, what happen to
       | Nvidia's leak? The hacker hasn't made any more news, and Nvidia
       | hasn't provide an update either.
        
         | [deleted]
        
         | Melatonic wrote:
         | What was the leak?
        
           | ksec wrote:
           | https://news.ycombinator.com/item?id=30590752
        
         | TomVDB wrote:
         | In the keynote, Jensen made a sly remark about how they
         | themselves could benefit a lot from one of their cyberthreat AI
         | solutions.
        
         | quotemstr wrote:
         | The simplest explanation is that Nvidia just paid up.
        
       | pixel_fcker wrote:
       | The new block cluster shared memory and synchronisation stuff
       | looks really really nice.
        
       | [deleted]
        
       | throw0101a wrote:
        
       | ortusdux wrote:
       | 80 billion transistors boggles my mind. How many molecules are
       | their per transistor?
        
         | martini333 wrote:
         | wat
        
         | Symmetry wrote:
         | It's a crystal so just one molecule for all the transistors. In
         | terms of atoms it's something on the order of the size of a 30
         | nm cube and with each silicon atom being .2nm in diameter
         | something like 3 million, give or take an order of magnitude or
         | two.
        
           | ortusdux wrote:
           | That makes sense. My mistake, I did mean atoms, not
           | molecules. Wolfram alpha estimates 1.35 million Si atoms, so
           | well within 1 order of magnitude.
           | 
           | https://www.wolframalpha.com/input?i=30%5E3+cubic+nanometers.
           | ..
        
       | virtuallynathan wrote:
       | How does a DGX Pod w/ the new 3.2Tbps per machine NVLINK switch
       | compare to Tesla Dojo?
        
         | virtuallynathan wrote:
         | Tesla Dojo Training tile (25x D1): 565 TF FP32 / 9 PF BF16/CFP8
         | / 11GB SRAM / 10kW
         | 
         | NVIDIA DGX H100 (8x H100): 480 TF FP32 / 8 PF+ TF16 / 16 PF
         | INT8 / 640GB HBM3 / 10kW
         | 
         | Dojo off-chip BW: 16 TB/s / 36TB/s off-tile
         | 
         | H100 off-chip BW: 3.9TB/s / 400GB/s off-DGX
        
         | TomVDB wrote:
         | When you take software support into account, probably very
         | favorable.
         | 
         | I don't know anything about the state of Dojo, but Tesla was
         | very hand wavy about their software stack during their
         | presentation. And running AI algorithms efficiently on a piece
         | of hardware is one of those things that many HW vendors have a
         | hard time getting right.
        
       | waynecochran wrote:
       | This seems fast...                    TF32  ....... 1,000 TFLOPS
       | (tensor core)          FP64/FP32 ...    60 TFLOPS
       | 
       | I am more interested in the 144-core Grace CPU Superchip. nVidia
       | is getting into the CPU business...
        
         | [deleted]
        
         | macrolocal wrote:
         | 50% sparsity and rated at 700W. The new DGX is 10kW!
        
           | wmwmwm wrote:
           | I was recently researching how you'd host systems like this
           | in a datacentre and was blown away to find out that you can
           | cool 40kW in a single air cooled rack - this might be old
           | news for many, but it was 2x or 3x what I expected! Glad I'm
           | not paying the electricity bill :)
        
             | jmole wrote:
             | Here's what a propane heater of similar output looks like:
             | https://www.amazon.com/Dura-Heat-Propane-Forced-
             | Heater/dp/B0...
        
               | HelloNurse wrote:
               | Most of the propane heater is a fan in a tube, the flame
               | is probably quite smaller than a CPU package.
        
               | baq wrote:
               | I've got an 8kW wood stove and that thing gets rather hot
               | to touch - as in, you will get a blister... 40kW is a
               | small city car worth of power.
        
         | cjbgkagh wrote:
         | I think the 1PFLOPS figure for TF32 is with sparsity, which
         | should be called out in the name. Maybe 'TFS32'? I mainly use
         | dense FP16 so the 1PFLOPS for that looks pretty good.
        
           | lostmsu wrote:
           | Asked elsewhere, but why FP16 as opposed to BF16?
        
             | cjbgkagh wrote:
             | I'm using older Turing GPUs BF16 would require Ampere. The
             | weights in my models tend to be normalized so the fraction
             | would be more important than the exponent so I would
             | probably still use FP16. I would need to test it though.
        
         | Melatonic wrote:
         | Same - plus its SUPER
        
       | rafaelero wrote:
       | "Combined with the additional memory on H100 and the faster
       | NVLink 4 I/O, and NVIDIA claims that a large cluster of GPUs can
       | train a transformer up to 9x faster, which would bring down
       | training times on today's largest models down to a more
       | reasonable period of time, and make even larger models more
       | practical to tackle."
       | 
       | Looking good.
        
         | [deleted]
        
         | ml_hardware wrote:
         | The 9x speedup is a bit inflated... it's measured at a
         | reference point of ~8k GPUs, on a workload that the A100
         | cluster is particularly bad at.
         | 
         | When measured at smaller #s of GPUs which are more realistic,
         | the speedup is somewhere between 3.5x - 6x. See the GTC Keynote
         | video at 38:50: https://youtu.be/39ubNuxnrK8?t=2330
         | 
         | Based on hardware specs alone, I think that training
         | transformers with FP8 on H100 systems vs. FP16 on A100 systems
         | should only be 3-4x faster. Definitely looking forward to
         | external benchmarks over the coming months...
        
         | Melatonic wrote:
         | We have needed wide use of NVlink or something like it for a
         | long time now......heres to hoping mobo manufacturers actually
         | widely implement it!
        
           | learndeeply wrote:
           | The open standard version of NVLink is CXL. They're available
           | in latest gen CPUs.
        
             | Melatonic wrote:
             | Interesting - I did not know that. Don't we also need
             | motherboard manufacturers though to more widely implement
             | the hardware required? It has been awhile since I have read
             | about NVlink to be fair
        
       | komuher wrote:
       | 1000 TFLOPS so i can run my GPT3 in under 100 ms locally :D
       | 
       | If 1000 TFLOPS is possible to do in inference time then im
       | speechless
        
         | edf13 wrote:
         | At what costs I wonder?
        
           | komuher wrote:
           | I would assume about 30-40k usd but we'll see
        
           | Melatonic wrote:
           | Huge recurrent licensing costs is the killer with these
        
         | ml_hardware wrote:
         | At inference time it will be possible to do 4000 TFLOPS using
         | sparse FP8 :)
         | 
         | But keep in mind the model won't fit on a single H100 (80GB)
         | because it's 175B params, and ~90GB even with sparse FP8 model
         | weights, and then more needed for live activation memory. So
         | you'll still want atleast 2+ H100s to run inference, and more
         | realistically you would rent a 8xH100 cloud instance.
         | 
         | But yeah the latency will be insanely fast given how massive
         | these models are!
        
           | TOMDM wrote:
           | So, we're about a 25-50% memory increase off of being able to
           | run GPT3 on a single machine?
           | 
           | Sounds doable in a generation or two.
        
             | ml_hardware wrote:
             | Couple points:
             | 
             | 1) NVIDIA will likely release a variant of H100 with 2x
             | memory, so we may not even have to wait a generation. They
             | did this for V100-16GB/32GB and A100-40GB/80GB.
             | 
             | 2) In a generation or two, the SOTA model architecture will
             | change, so it will be hard to predict the memory reqs...
             | even today, for a fixed train+inference budget, it is much
             | better to train Mixture-Of-Experts (MoE) models, and even
             | NVIDIA advertises MoE models on their H100 page.
             | 
             | MoEs are more efficient in compute, but occupy a lot more
             | memory at runtime. To run an MoE with GPT3-like quality,
             | you probably need to occupy a full 8xH100 box, or even
             | several boxes. So your min-inference-hardware has gone up,
             | but your efficiency will be much better (much higher
             | queries/sec than GPT3 on the same system).
             | 
             | So it's complicated!
        
               | TOMDM wrote:
               | Oh I totally expect the size of models to grow along with
               | whatever hardware can provide.
               | 
               | I really do wonder how much more you could squeeze out of
               | a full pod of gen2-H100's, obviously the model size would
               | be ludicrous, but how far are we into the realm of
               | dimishing returns.
               | 
               | Your point about MoE architectures certainly sounds like
               | the more _useful_ deployment, but the research seems to
               | be pushing towards ludicrously large models.
               | 
               | You seem to know a fair amount about the field, is there
               | anything you'd suggest if I wanted to read more into the
               | subject?
        
               | ml_hardware wrote:
               | I agree! The models will definitely keep getting bigger,
               | and MoEs are a part of that trend, sorry if that wasn't
               | clear.
               | 
               | A pod of gen2-H100s might have 256 GPUs with 40 TB of
               | total memory, and could easily run a 10T param model. So
               | I think we are far from diminishing returns on the
               | hardware side :) The model quality also continues to get
               | better at scale.
               | 
               | Re. reading material, I would take a look at DeepSpeed's
               | blog posts (not affiliated btw). That team is super super
               | good at hardware+software optimization for ML. See their
               | post on MoE models here: https://www.microsoft.com/en-
               | us/research/blog/deepspeed-adva...
        
               | algo_trader wrote:
               | Is it difficult/desirable to squeeze/compress an open-
               | sourced 200B parameter model to fit into 40GB?
               | 
               | Are these techniques for specific architectures or can
               | they be made generic ?
        
               | algo_trader wrote:
               | Ah, found some stuff already
               | 
               | https://www.tensorflow.org/model_optimization/guide/pruni
               | ng
               | 
               | https://www.tensorflow.org/model_optimization/guide/pruni
               | ng/...
        
               | ml_hardware wrote:
               | I think it depends what downstream task you're trying to
               | do... DeepMind tried distilling big language models into
               | smaller ones (think 7B -> 1B) but it didn't work too
               | well... it definitely lost a lot of quality (for general
               | language modeling) relative to the original model.
               | 
               | See the paper here, Figure A28: https://kstatic.googleuse
               | rcontent.com/files/b068c6c0e64d6f93...
               | 
               | But if your downstream task is simple, like sequence
               | classification, then it may be possible to compress the
               | model without losing much quality.
        
         | learndeeply wrote:
         | GPT-3 can't fit in 80GB of RAM.
        
       | savant_penguin wrote:
       | 1 petaflop on a chip?? What is the catch?
        
         | dragontamer wrote:
         | Tensor petaflops are useful in only very few circumstances. One
         | of which is the highly lucrative deep learning community
         | though.
        
           | cjbgkagh wrote:
           | The main tensor op is a matmul intrinsic which is useful for
           | way more than just deep learning.
           | 
           | Edit; many of these speeds are low precision which is less
           | useful outside of deep learning, but the higher precision
           | matmul ops in the tensor cores are still very fast and very
           | useful for wide variety of tasks.
        
             | dragontamer wrote:
             | > but the higher precision matmul ops in the tensor cores
             | are still very fast and very useful for wide variety of
             | tasks.
             | 
             | The FP64 matrix-multiplication is only 60 TFlops, no where
             | near the advertized 1000 TFlops. TF32 matrix-multiplication
             | is a poorly named 16-bit operation.
        
               | cjbgkagh wrote:
               | You are indeed correct, I was (kinda) fooled by the
               | marketing and I think that TF32 is deceptively named. I
               | think the tensor cores are being used in this
               | architecture for FP64 and 60 TFlops is still pretty
               | decent.
               | 
               | I'm on Turing architecture so I've never used TF32. I've
               | only used FP32 and FP16 but FP32 isn't supported by these
               | tensor cores.
        
               | bcatanzaro wrote:
               | Well the addition is done in FP32, and it's a 32-bit
               | storage format in memory, so calling it a 16-bit format
               | isn't right either. It's really a hybrid format where
               | everything is 32-bit except multiplication.
               | 
               | Given that it's 32-bit in memory (so all your data
               | structures are 32-bit) and also that in my experience
               | using it is very transparent (I haven't run into any
               | numerical issues compared to full FP32), I think calling
               | it a 32-bit format is a reasonable compromise.
        
               | dragontamer wrote:
               | > Well the addition is done in FP32
               | 
               | Addition is done in 10-bit mantissa. So maybe TF19 might
               | be the better name, since its a 19-bit format (slightly
               | more than 16-bit BFloats).
               | 
               | Really, its a BFloat with a 10-bit mantissa instead of a
               | 7-bit mantissa. 10-bit mantissa matches FP16, while the
               | 8-bit exponent matches FP32.
               | 
               | So TF19 probably would have been the best name, but
               | NVidia like marketing so they call it TF32 instead.
        
               | bcatanzaro wrote:
               | It's a 32-bit format in memory and the additions are done
               | with 32-bits.
        
               | dragontamer wrote:
               | I admit that I don't have the hardware to test your
               | claims. But pretty much all the whitepapers I can find on
               | TF32 explicitly state the 10-bit mantissa, suggesting
               | that this is at best, a 19-bit format. 1-bit sign + 8-bit
               | exponent + 10-bit mantissa.
               | 
               | Yes, the system will read/write the 32-bit value to RAM.
               | But if there's only 10-bits of mantissa in the circuits,
               | you're only going to get 10-bits of precision (best
               | case). The 10-bit mantissa makes sense because these
               | systems have FP16 circuits (1 + 5-bit exponent + 10-bit
               | mantissa) and BFloat16 circuits (1 sign + 8-bit exponent
               | + 7-bit mantissa). So the 8-bit exponent circuit + 10-bit
               | mantissa circuit exists physically on those NVidia cores.
               | 
               | -------
               | 
               | But the 'Tensor Cores' do not support 32-bit (aka: 23-bit
               | mantissa) or higher.
        
               | my123 wrote:
               | Yup, in a semi-related field, NVIDIA has 3xTF32 for cases
               | needing higher precision:
               | https://github.com/NVIDIA/cutlass/discussions/361
        
               | touisteur wrote:
               | There's a paper on getting fp32 accuracy using tf32
               | tensor cores and losing 3x efficiency. Can't wait to try
               | it with cutlass... once I get how to use cutlass, woof.
        
         | peter303 wrote:
         | DP Linpack flops is what counts in supercomputer ranking. Stuck
         | at .44 Exoflops in 2021.
        
         | aninteger wrote:
         | Given that it's Nvidia, no Linux support. That's the catch.
        
           | jamesfmilne wrote:
           | All the AI software running on these data-centre chips is
           | almost exclusively running on Linux.
           | 
           | I wish people would stop talking rubbish about NVIDIA's Linux
           | support.
        
             | chockchocschoir wrote:
             | That's because nvidias linux support for consumers is
             | indeed trash, while their creators/business/creatives
             | software (eg CUDA) is not trash, but you mostly hear
             | consumers trashing nvidia.
        
               | pjmlp wrote:
               | Only FOSS zealots actually, the rest of us is quite ok
               | with their binary drivers.
        
               | oblio wrote:
               | They don't make (relevant) money from consumer hardware
               | on Linux.
        
           | ScaleneTriangle wrote:
           | I thought that only applied to their consumer products.
        
             | jsheard wrote:
             | Their consumer products have Linux support too, the catch
             | is just that the drivers are proprietary binary blobs
        
             | TheRealSteel wrote:
             | Don't they provide Linux drivers for their gaming graphics
             | cards too, just not open source?
        
               | gpm wrote:
               | Yes
        
           | AHTERIX5000 wrote:
           | No Linux support? Guess I'll have to keep using Solaris with
           | my A4000!
        
           | simulate-me wrote:
           | Nvidia provides Linux drivers for their server chips.
        
             | TheRealSteel wrote:
             | Don't they provide them for their consumer cards too, just
             | that it's a closed source binary blob?
        
               | throw0101a wrote:
               | And not just Linux: FreeBSD.
               | 
               | * https://www.nvidia.com/en-
               | us/drivers/unix/freebsd-x64-archiv...
               | 
               | * https://www.freshports.org/x11/nvidia-driver
               | 
               | Heck, _Solaris_ :
               | 
               | * https://www.nvidia.com/en-us/drivers/unix/solaris-
               | display-ar...
               | 
               | * https://www.nvidia.com/en-us/drivers/unix/
        
               | jxy wrote:
               | CUDA and related software/libraries only work on Linux or
               | Windows.
        
               | lostmsu wrote:
               | Some are even Linux only like nccl (AFAIK required to
               | fully use NVLink)
        
           | kcb wrote:
           | That's a strange statement. The vast vast majority of these
           | cards will be in systems running Linux.
        
             | savant_penguin wrote:
             | I for one suffer deeply when I try to install the nvidia
             | drivers on Linux. The website binaries _always_ break my
             | system
             | 
             | Only the ppas from graphics-drivers work properly
             | 
             | My experience on windows is much more automatic and it
             | never breaks anything. But I'd rather pay the price
             | (installing on Linux) to avoid windows at all costs
        
               | riotnrrd wrote:
               | If you installed the drivers using the PPAs, you can't
               | then update using the NVIDIA-provided binaries without
               | doing a very thorough purge, including deleting all
               | dependent installs (CUDNN, CUBLAS, etc.)
               | 
               | I highly recommend sticking with one technique or the
               | other; never intermix them.
        
               | kcb wrote:
               | Yea it's not ideal but really no option is. Built in to
               | Linux would be a problem too given the rate of GPU driver
               | development. Most Linux installs in the corporate world
               | are stuck on the major version of the kernel and system
               | packages they shipped with.
        
         | hughrr wrote:
         | 700 watts so being NVidia it'll blow up in 6 months and you'll
         | need to wait in a queue for 6 months to RMA it because all the
         | miners had bought up the entire supply chain.
        
           | touisteur wrote:
           | Those datacenter/hpc GPUs don't seem to get bought so much by
           | the mining community? I don't have problems sourcing some
           | through the usual channels (HPE, dell,...?). But you need
           | somehow deep pockets.
        
         | p1esk wrote:
         | The catch is it's only for TF32 computations (Nvidia
         | proprietary 19 bit floating point number format)
        
           | cjbgkagh wrote:
           | I missed that, to me that makes the '32' in the name
           | misleading.
        
             | p1esk wrote:
             | TF32 = FP32 range + FP16 precision
        
               | cjbgkagh wrote:
               | Why not call it TF19 then.
        
               | 37ef_ced3 wrote:
               | Because it's 32-bits wide in memory.
               | 
               | The effective mantissa is like FP16 but it's padded out
               | to be the same size as FP32.
               | 
               | In other words, there's 1 sign bit, 8 exponent bits, 10
               | mantissa bits that are USED, and 13 mantissa bits that
               | are IGNORED.
               | 
               | 1 + 8 + 10 + 13 = 32
               | 
               | The 13 ignored mantissa bits are part of the memory
               | image: they pad the number out to 32-bit alignment.
        
               | cjbgkagh wrote:
               | But the user never sees that memory right? Doesn't it go
               | in FP32 and come out FP32? I still think it's deceptive
               | marketing.
        
               | bcatanzaro wrote:
               | The user does see 32-bits and all bits are used because
               | all the additions (and other operations besides the
               | multiply in matrix ops) are in FP32. So the bottom bits
               | are populated with useful information.
        
               | p1esk wrote:
               | Because your existing FP32 models should run fine when
               | converted to TF32, so TF32 is equivalent to FP32 as far
               | as DL practitioners are concerned.
        
               | cjbgkagh wrote:
               | There is a lot of redundancy in DL that forgives all
               | manner of sins; think it's sneaky.
        
       | fancyfredbot wrote:
       | The Tensor cores will be great for machine learning and the
       | FP32/FP64 fantastic for HPC, but I'd be surprised if there were a
       | lot of applications using both of these features at once. I
       | wonder if there's room for a competitor to come in and sell
       | another huge accelerator but with only one of these two features
       | either at a lower price or with more performance? Perhaps the
       | power density would be too high if everything was in use at once?
        
       ___________________________________________________________________
       (page generated 2022-03-22 23:00 UTC)