[HN Gopher] Nvidia Hopper GPU Architecture and H100 Accelerator ___________________________________________________________________ Nvidia Hopper GPU Architecture and H100 Accelerator Author : jsheard Score : 214 points Date : 2022-03-22 15:52 UTC (7 hours ago) (HTM) web link (www.anandtech.com) (TXT) w3m dump (www.anandtech.com) | lmeyerov wrote: | Seeing the increased bandwidth is super exciting for a lot of | business analytics cases we get into for | IT/security/fraud/finance teams: imagine correlating across lots | of event data from transactions, logs, ... . Every year, it just | goes up! | | The big welcome surprise for us the secure virtualization. | Outside of some limited 24/7 ML teams, we mostly see bursty | multi-tenant scenarios for achieving cost-effective utilization. | MiG etc static physical partitioning was interesting -- I can | imagine cloud providers giving that -- but more dynamic & logical | isolation, with more of a focus on namespace isolation, is more | relevant to what we see. Once we get into federated learning, and | further disintermediations around that, even more cool. Imagine | bursting on 0.1-100 GPUs every 30s-20min. Amazing times! | algo_trader wrote: | From the nvidia page, | | > 80 billion transistors | | > Hopper H100 .. generational leap | | > 9x at-scale training performance over A100 | | > 30x LLM inference throughput | | > Transformer Engine .. speed .. 6x without losing accuracy | | So another monster chip - same size of the Apple M1-max thingy .. | | I guess it comes down to pricing. The A100 is already | ridiculously expensive at $10K. They can this one at $50K and it | would sell out? | p1esk wrote: | You can buy A100s in a server today, a number of integrators | will happily sell it to you. | chockchocschoir wrote: | As someone who've tried for some weeks, it really seems like | it's out-of-stock literally everywhere. The demand seems to | be a lot higher than the supply at the moment, so much that | I'm considering buying one myself instead of renting servers | with it. | nqnielsen wrote: | And even if vendors does say they have it, or can get it, | it ended up taking us 4-6 months before systems were | online. | kovek wrote: | Does it make sense that all the GPUs are bought out? They | each provide a return for mining in the short-term. In the | long term, they can be used to run A(G)I models, which will | be very very useful | fennecfoxen wrote: | This is the GPU the parent is talking about | https://www.nvidia.com/en-us/data-center/a100/ | kovek wrote: | This still makes sense! TPUs are useful for AI, which | itself will be very very useful. It's almost like it's | the best investment. That's why smart players buy them | all. Maybe I'm going out-of-topic. | p1esk wrote: | Did you check Lambda or Exxact? | chockchocschoir wrote: | Yes, nor Lambda Labs or Exxact Corporation have them | available last time I checked (last week). Both citing | high demand as the reason for it being unavailable. | asciimike wrote: | Howdy, I run [Crusoe Cloud](https://crusoecloud.com/) and | we just launched an alpha of an A100 and A40 Cloud | offering--we've got capacity at a reasonable price! | | If you're interested in giving us a shot, feel free to | shoot me an email at mike at crusoecloud dot com. | sabalaba wrote: | We (Lambda) have all of the different NVIDIA GPUs in | stock ---- can you send a message to sales@lambdalabs.com | and check in again with your requirements? We're seeing a | lot more stock these days as the supply chain crisis of | 2021 comes to an end. | Uehreka wrote: | Most people who use one of these will be doing so through an | EC2 VM (or equivalent). Given that cloud platforms can spread | load, keep these GPUs churning close to 24/7 and more easily | predict/amortize costs, they'll probably buy the amount that | they know they need, and Nvidia probably has some approximately | correct idea of what that number is. | spoonjim wrote: | Off topic but I can't stand when corporations use actual people's | names for their marketing who never gave them the permission to | do so. For something like Shakespeare or Cicero I'm OK with it | but Grace Hopper was alive in my lifetime, and even Tesla feels a | little weird. What gives you the right to use that person's | reputation to shill your product? | paxys wrote: | > What gives you the right to use that person's reputation to | shill your product? | | Practically speaking you have the right to do anything unless | someone complains about it. A lot of popular figures, even | those long dead, have estates and organizations that manage | their likeliness and other related copyright and IP. IDK what | the situation is in this case, but Nvidia may very well have | paid for the name. | erosenbe0 wrote: | The situation is that various Australian companies (think | Kangaroo) and DISH network already have Hopper product lines | and Nvidia didn't care about getting into a legal kerfuffle | and used the name anyway. As to whether Hopper's estate was | consulted I don't know. | cosmiccatnap wrote: | spoonjim wrote: | I don't think my kids have any more right to use my name than | a corporation, unless I specifically grant them that right | (like Walt Disney did by naming it the Walt Disney company). | Another sickening one is the Ed Lee Club in SF, who endorses | political candidates under the name of a much-loved dead SF | mayor. | paxys wrote: | Your kids have the right to everything you own ( | _including_ your name) by default unless you take steps to | change that, say using a will or estate. | spoonjim wrote: | Yes, I know, I'm saying that it should not be that way. | Rights to your likeness should end at your death unless | you specifically write down otherwise. | oblio wrote: | Do you have kids? | spoonjim wrote: | Yes. | oblio wrote: | And you think they shouldn't have that right because of | social concerns like accumulation of wealth? | erosenbe0 wrote: | Theranos' "Edison" machine enters the chat... | foolfoolz wrote: | what gives you the right to own a name ever? especially once | your dead? | eigenvalue wrote: | I generally agree with you, but in this case I suspect Grace | Hopper would be honored by it and also impressed with the | engineering here. It's not like they slapped her name on a soda | can or something. | gautamcgoel wrote: | This chip is capable of 2000 INT8 Tensor TOPS, or 1000 F16 Tensor | TFLOPS. In other words, it is capable of performing over a | quadrillion operations per second. Absolutely insane... I still | have fond memories of installing my first NVidia gaming GPU, with | just 512MB of RAM, probably capable of much less than a single | teraflop of compute. | Rafuino wrote: | Good lord 700W TDP! | Symmetry wrote: | NVidia and AMD datacenter GPUs continue to diverge between | focusing on deep learning and traditional scientific computing | respectively. | orangebeet wrote: | This would be cool if they had decent drivers for Linux. | why_only_15 wrote: | They do have good drivers for Linux for the things this chips | is intended to be used for (research, ML). | johndough wrote: | If you haven't had any issues with NVIDIA Linux drivers, you | can count yourself extremely lucky. In the past, I had a | 50/50 chance of boot failure after installing CUDA drivers | over 12 different systems. Mainline Ubuntu drivers are | somewhat stable, but installing a specific CUDA version from | the official NVIDIA repos rarely works on the first try. | Switching from Tensorflow to PyTorch has helped a lot though, | as Tensorflow was much more picky about the installed CUDA | version. | | Obligatory Linus Torvals on NVIDIA: | https://www.youtube.com/watch?v=_36yNWw_07g | paxys wrote: | I can assure you systems that take advantage of this chip for | scientific/ML workloads aren't running Windows. | flatiron wrote: | they may have edited their comment but they were commenting | on the lack of quality of their Linux drivers (which I agree | with but only on a consumer level, never used nvidia in a | server) | obeliskora wrote: | Anyone find details about the DPX instructions for dynamic | programming? | pjmlp wrote: | There are some deep dive sessions at GTC that will probably go | into them. | trollied wrote: | Looking forward to seeing Doom run on this. | [deleted] | minimaxir wrote: | So the product naming for Nvidia's server-GPUs by compute power | now goes: | | P100 -> V100 -> A100 -> H100 | | This is not confusing at all. | torginus wrote: | I think this is less of an issue since these GPUs are not meant | for the everyman, so basically the handful of server | integrators can figure this out by themselves. | | And for your typical dev - they'll interact with the GPU | through a cloud provider, where they can easily know that a G5 | instance is newer than a G4 one. | gtirloni wrote: | Isn't it based on the architecture name? | | https://en.wikipedia.org/wiki/Category:Nvidia_microarchitect... | jsheard wrote: | Yeah it is, but unless you've memorised the history of Nvidia | architectures it doesn't tell you which is the newer one | | Fermi -> Kepler -> Maxwell -> Pascal -> Volta (HPC only) -> | Turing -> Ampere -> Hopper (HPC only?) -> Lovelace? | modeless wrote: | Someone should make a game like "Pokemon or Big Data" [1] | except you have to choose which of two GPU names is faster. | Even the consumer naming is bonkers so there's plenty of | material there! | | [1] http://pixelastic.github.io/pokemonorbigdata/ | ksec wrote: | Isn't this the norm? Only AMD started the trend of naming | the uArch with Numbers as Zen 4 or RDNA 3 fairly recently. | With Intel it is Haswell > Broadwell > ..... Whatever Lake. | kergonath wrote: | Intel is using generation numbers in their marketing | materials. In the technical-oriented slide decks you'd | see things like "42th generation formerly named Bullshit | Creek" but they are not supposed to use that for sales. | And then actual part names like i9-42045K. | | We keep using code names in discussions because the | actual names are ass backwards and not very descriptive. | jsheard wrote: | Usually the architecture name isn't the only | distinguishing feature of the product name, you don't | need to remember Intel codenames because a Core 12700 is | obviously newer than a Core 11700 | | Nvidia's accelerators are just called <Architecture | Letter>100 every time so if you don't remember the order | of the letters it's not obvious | | They could have just named them P100, V200, A300 and H400 | instead | 867-5309 wrote: | >you don't need to remember Intel codenames because a | Core 12700 is obviously newer than a Core 11700 | | J3710, 7th Gen J3060, 8th Gen | | J4205, 8th Gen J4125, 9th Gen | | i3-5005U, 5th Gen N5095, 10th Gen | | i7-3770, 3rd Gen 3865U, 7th Gen N3060, 8th Gen | paulmd wrote: | And an AMD 5700U is older than a 5400U as well. A 3400G | is older than a 3100X. 3300X isn't really distinctive | from 3100X, both are quad-core configurations (but | different CCD/cache configurations, which is of course | the name doesn't really disclose to the consumer). It | happens, naming is a complex topic and there's a lot of | dimensions to a product. | | In general, complaining about naming is peak bikeshedding | for the tech-aware crowd. There are multiple naming | schemes, all of them are reasonable, and everyone hates | some of them for completely legitimate reasons (but | different for every person). And the resulting | bikeshedding is exactly as you'd expect with that. | | The underlying problem is that products have multiple | dimensions of interest - you've got architecture, big vs | small core, core count, TDP, clockrate/binning, cache | configuration/CCD configuration, graphics configuration, | etc. If you sort them by generation, then an older but | higher-spec can beat a newer but lower-spec. If you sort | by date then refreshes break the scheme. If you split | things out into series (m7 vs i7) to express TDP then | some people don't like that there's a bunch of different | series. If you put them into the same naming scheme then | some people don't like that a 5700U is slower than a | 5700X. If you try to express all the variables in a | single name, you end up with a name like "i7 1185G7" | where it's incomprehensible if you don't understand what | each of the parts of the name mean. | | (as a power user, I personally think the Ice Lake/Tiger | Lake naming is the best of the bunch, it expresses | everything you need to know: architecture, core count, | power, binning, graphics. But then big.LITTLE had to go | and mess everything up! And other people still hated it | because it was more complex.) | | There are certain ones like AMD's 5000 series or the | Intel 10th-gen (Comet Lake 10xxxU) that are just really | ghastly because they're deliberately trying to mix-and- | match to confuse the consumer (to sell older stuff as | being new), but in general when people complain about | "not understanding all those Lakes and Coves" it's | usually just because they aren't interested in the | brand/product and don't want to bother learning the | names, and they will eagerly rattle off a list of | painters or cities that AMD uses as their codenames. | | Like, again, to reiterate here, I literally never have | seen anyone raise AMD using painter names as being | "opaque to the consumer" in the same way that people | repeatedly get upset about lakes. And it's the exact same | thing. It's people who know the AMD brand and don't know | the Intel brand and think that's some kind of a problem | with the branding, as opposed to a reflection of their | own personal knowledge. | | I fully expect that AMD will release 7000 series desktop | processors this year or early next year, and exactly 0 | people are going to think that a 7600 being newer than a | 7702 is confusing in the way that we get all these | aggrieved posts about Intel and NVIDIA. Yes, 7600 and | 7702 are different product lines, and that's the exact | same as your "but i7 3770 and N3060 are different!" | example. It's simply not that confusing, it takes less | time to learn than to make a single indignant post on | social media about it. | | Similarly, the NVIDIA practice of using inventors/compsci | people is not particularly confusing either. Basically | the same as AMD with the painters/cities. | | It's just not that interesting, and it's not worth all | the bikeshedding that gets devoted to it. | | </soapbox> | | Anyway, your example is all messed up though. J3710 and | J3060 are both the same gen (Braswell), launched at the | same time (Q1 2016), that example is entirely wrong. | J4125 vs J4205 is an older but higher specced processor | vs a newer but lower spec, it's a 8th gen Pentium vs a | 9th gen Celeron, like a 3100X vs a 2700X (zomg 3100X is | bigger number but actually slower!). And the J4125 and | J4205 are refreshes of the same architecture with | legitimately very similar performance classes. i3 and | Atom or i7 and Atom are completely different product | lines and the naming is not similar at all there, apart | from both having 3s as their first number (not even first | character, that is different too, just happen to share | the first number somewhere in the name). | | Again, like with the Tiger Lake 11xxGxx naming, the | characters and positions in the name have meaning. You | can come up with better examples than that even within | the Intel lineup. Just literally picking 3770 and J3060 | as being "similar" because they both have 3s in them. | | The one I would legitimately agree on is that the Atom | lineup is kind of a mess. Braswell, Apollo Lake, Gemini | Lake, and Gemini Lake Refresh are all crammed into the | "3000/4000" series space, and there is no "generational | number" in that scheme either. Braswell is all 3000 | series and Gemini Lake/Gemini Lake Refresh is all 4000 | series but you've got Apollo Lake sitting in the middle | with both 3000 and 4000 series chips. And a J3455 (Apollo | Lake 1.5 GHz) is legitimately a better (or at least | equal) processor to a J3710 (Braswell 1.6 GHz). Like | 5700U vs 5800U, there are some legitimate architectural | differences behind hidden behind an opaque number there | (and on the Intel it's graphics - Gemini Lake/Gemini Lake | Refresh have a much better video block). | | (And that's the problem with "performance rating" | approaches, even if a 3710 and a 3455 are similar in | performance there's still other differences between them. | Also, PR naming instantly turns into gamesmanship - what | benchmark, what conditions, what TDP, what level of | threading? Is an Intel 37000 the same as an AMD 37000?) | 867-5309 wrote: | yes, it's a bit of a shitshow, as mutually evidenced. | unless consumers brush up on such intricate details (most | do not), they will inevitably fall into traps such as "i7 | is better than i3" e.g. i7-2600 being outperformed by | i3-10100 and "quad core is better than dual core". | marketing is becoming more focused on generations now | which is a prudent move: "10th Gen is better than 2nd | Gen" but it will be at least a decade before the shitshow | is swept | mywittyname wrote: | Intel was using Core i[3,5,7] names for multiple | generations. A Core i7 could be faster or slower than a | Core i5 depending on which generation each existed in. | | It is nice when products have a naming scheme where | natural ordering of the name maps to performance. | numpad0 wrote: | We need a canonical, chronologically monotonic, marketing | independent ID scheme. Marketing people always tries to | disrupt naming schemes and that's the real problem. | bee_rider wrote: | I don't really mind the incomprehensible letters -- | looking up the generation is pretty easy, and these are | data-center focused products... getting the name right is | somebody's job and the easiest possible thing. | | However, is the number superfluous at this point? | neogodless wrote: | You just blew my mind. That did not occur to me, but it is | obvious in retrospect. | Uehreka wrote: | Sure, but the way Nvidia names generations is far from | obvious. It seems to be "names of famous scientists, | progressing in alphabetical order, we skip some letters if | we can't find a well known scientist with a matching last | name and are excited about a scientist 2 letters from now, | we wrap around to the beginning of the alphabet when we get | to the end, and we just skipped from A to H, so expect | another wraparound in the next 5-10 years." | [deleted] | ipsin wrote: | I mean, I'd already internalized P100 < V100 < A100 as a Colab | user. | | Schedule me on an H100 and I promise I won't mind the | "confusing" naming. | jjoonathan wrote: | Also, the naming drives devs towards the architecture papers, | which are important if you want to get within sight of | theoretical perf. When NVidia changes the letter, it's like | saying "hey, pay attention, at least skim the new | whitepaper." Over the last decade, I feel like this | convention has respected my time, so in turn it has earned my | own respect. I'll read the Hopper whitepaper tonight, or | whenever it pops up. | nynx wrote: | Sounds like we need some new training methods. If training could | take place locally and asynchronously instead of globally through | backpropagation, the amount of energy could probably be | significantly reduced. | hwers wrote: | Trying to reduce energy consumption for ML like this is so | silly. | mlyle wrote: | Training costs are growing exponentially bigger. | | The degree to which energy and capital costs can be optimized | will determine how large they can go. | thfuran wrote: | Why? | oblio wrote: | Reducing energy consumption for computation is not silly. | | We're at a point we we're turning into a computation driven | society and computation is becoming a globally relevant power | consumption aspect. | | > global data centers likely consumed around 205 terawatt- | hours (TWh) in 2018, or 1 percent of global electricity use | | And that's just data centers, if you add all client devices | you probably double that. | | Plus that number will only continue to grow. | moinnadeem wrote: | Disclosure: I work at MosaicML | | Yeah, I strongly agree. While Nvidia is working on better | hardware (and they're doing a great job at it!), we believe | that better training methods should be a big source of | efficiency. We've released a new PyTorch library for efficient | training at http://github.com/mosaicml/composer. | | Our combinations of methods can train CV models ~4x faster to | the same accuracy on CV tasks, and ~2x faster to the same | perplexity/GLUE score on NLP tasks! | jwuphysics wrote: | I've been seeing a lot more about MosaicML on my Twitter | feed. Just wanted to ask -- how are your priorities different | than, say, Fastai? | zozbot234 wrote: | The principled way of doing this is via ensemble learning, | combining the predictions of multiple separately-trained | models. But perhaps there are ways of improving that by | including "global" training as well, where the "separate" | models are allowed to interact while limiting overall training | costs. | captainbland wrote: | Those specs imply some pretty crazy architectural efficiency | gains, massive theoretical compute performance per transistor | compared to Ampere. It's all marketing numbers until the | benchmarks are out, though. | | Edit: big TDP, though. | ksec wrote: | And may be taking this opportunity to ask, what happen to | Nvidia's leak? The hacker hasn't made any more news, and Nvidia | hasn't provide an update either. | [deleted] | Melatonic wrote: | What was the leak? | ksec wrote: | https://news.ycombinator.com/item?id=30590752 | TomVDB wrote: | In the keynote, Jensen made a sly remark about how they | themselves could benefit a lot from one of their cyberthreat AI | solutions. | quotemstr wrote: | The simplest explanation is that Nvidia just paid up. | pixel_fcker wrote: | The new block cluster shared memory and synchronisation stuff | looks really really nice. | [deleted] | throw0101a wrote: | ortusdux wrote: | 80 billion transistors boggles my mind. How many molecules are | their per transistor? | martini333 wrote: | wat | Symmetry wrote: | It's a crystal so just one molecule for all the transistors. In | terms of atoms it's something on the order of the size of a 30 | nm cube and with each silicon atom being .2nm in diameter | something like 3 million, give or take an order of magnitude or | two. | ortusdux wrote: | That makes sense. My mistake, I did mean atoms, not | molecules. Wolfram alpha estimates 1.35 million Si atoms, so | well within 1 order of magnitude. | | https://www.wolframalpha.com/input?i=30%5E3+cubic+nanometers. | .. | virtuallynathan wrote: | How does a DGX Pod w/ the new 3.2Tbps per machine NVLINK switch | compare to Tesla Dojo? | virtuallynathan wrote: | Tesla Dojo Training tile (25x D1): 565 TF FP32 / 9 PF BF16/CFP8 | / 11GB SRAM / 10kW | | NVIDIA DGX H100 (8x H100): 480 TF FP32 / 8 PF+ TF16 / 16 PF | INT8 / 640GB HBM3 / 10kW | | Dojo off-chip BW: 16 TB/s / 36TB/s off-tile | | H100 off-chip BW: 3.9TB/s / 400GB/s off-DGX | TomVDB wrote: | When you take software support into account, probably very | favorable. | | I don't know anything about the state of Dojo, but Tesla was | very hand wavy about their software stack during their | presentation. And running AI algorithms efficiently on a piece | of hardware is one of those things that many HW vendors have a | hard time getting right. | waynecochran wrote: | This seems fast... TF32 ....... 1,000 TFLOPS | (tensor core) FP64/FP32 ... 60 TFLOPS | | I am more interested in the 144-core Grace CPU Superchip. nVidia | is getting into the CPU business... | [deleted] | macrolocal wrote: | 50% sparsity and rated at 700W. The new DGX is 10kW! | wmwmwm wrote: | I was recently researching how you'd host systems like this | in a datacentre and was blown away to find out that you can | cool 40kW in a single air cooled rack - this might be old | news for many, but it was 2x or 3x what I expected! Glad I'm | not paying the electricity bill :) | jmole wrote: | Here's what a propane heater of similar output looks like: | https://www.amazon.com/Dura-Heat-Propane-Forced- | Heater/dp/B0... | HelloNurse wrote: | Most of the propane heater is a fan in a tube, the flame | is probably quite smaller than a CPU package. | baq wrote: | I've got an 8kW wood stove and that thing gets rather hot | to touch - as in, you will get a blister... 40kW is a | small city car worth of power. | cjbgkagh wrote: | I think the 1PFLOPS figure for TF32 is with sparsity, which | should be called out in the name. Maybe 'TFS32'? I mainly use | dense FP16 so the 1PFLOPS for that looks pretty good. | lostmsu wrote: | Asked elsewhere, but why FP16 as opposed to BF16? | cjbgkagh wrote: | I'm using older Turing GPUs BF16 would require Ampere. The | weights in my models tend to be normalized so the fraction | would be more important than the exponent so I would | probably still use FP16. I would need to test it though. | Melatonic wrote: | Same - plus its SUPER | rafaelero wrote: | "Combined with the additional memory on H100 and the faster | NVLink 4 I/O, and NVIDIA claims that a large cluster of GPUs can | train a transformer up to 9x faster, which would bring down | training times on today's largest models down to a more | reasonable period of time, and make even larger models more | practical to tackle." | | Looking good. | [deleted] | ml_hardware wrote: | The 9x speedup is a bit inflated... it's measured at a | reference point of ~8k GPUs, on a workload that the A100 | cluster is particularly bad at. | | When measured at smaller #s of GPUs which are more realistic, | the speedup is somewhere between 3.5x - 6x. See the GTC Keynote | video at 38:50: https://youtu.be/39ubNuxnrK8?t=2330 | | Based on hardware specs alone, I think that training | transformers with FP8 on H100 systems vs. FP16 on A100 systems | should only be 3-4x faster. Definitely looking forward to | external benchmarks over the coming months... | Melatonic wrote: | We have needed wide use of NVlink or something like it for a | long time now......heres to hoping mobo manufacturers actually | widely implement it! | learndeeply wrote: | The open standard version of NVLink is CXL. They're available | in latest gen CPUs. | Melatonic wrote: | Interesting - I did not know that. Don't we also need | motherboard manufacturers though to more widely implement | the hardware required? It has been awhile since I have read | about NVlink to be fair | komuher wrote: | 1000 TFLOPS so i can run my GPT3 in under 100 ms locally :D | | If 1000 TFLOPS is possible to do in inference time then im | speechless | edf13 wrote: | At what costs I wonder? | komuher wrote: | I would assume about 30-40k usd but we'll see | Melatonic wrote: | Huge recurrent licensing costs is the killer with these | ml_hardware wrote: | At inference time it will be possible to do 4000 TFLOPS using | sparse FP8 :) | | But keep in mind the model won't fit on a single H100 (80GB) | because it's 175B params, and ~90GB even with sparse FP8 model | weights, and then more needed for live activation memory. So | you'll still want atleast 2+ H100s to run inference, and more | realistically you would rent a 8xH100 cloud instance. | | But yeah the latency will be insanely fast given how massive | these models are! | TOMDM wrote: | So, we're about a 25-50% memory increase off of being able to | run GPT3 on a single machine? | | Sounds doable in a generation or two. | ml_hardware wrote: | Couple points: | | 1) NVIDIA will likely release a variant of H100 with 2x | memory, so we may not even have to wait a generation. They | did this for V100-16GB/32GB and A100-40GB/80GB. | | 2) In a generation or two, the SOTA model architecture will | change, so it will be hard to predict the memory reqs... | even today, for a fixed train+inference budget, it is much | better to train Mixture-Of-Experts (MoE) models, and even | NVIDIA advertises MoE models on their H100 page. | | MoEs are more efficient in compute, but occupy a lot more | memory at runtime. To run an MoE with GPT3-like quality, | you probably need to occupy a full 8xH100 box, or even | several boxes. So your min-inference-hardware has gone up, | but your efficiency will be much better (much higher | queries/sec than GPT3 on the same system). | | So it's complicated! | TOMDM wrote: | Oh I totally expect the size of models to grow along with | whatever hardware can provide. | | I really do wonder how much more you could squeeze out of | a full pod of gen2-H100's, obviously the model size would | be ludicrous, but how far are we into the realm of | dimishing returns. | | Your point about MoE architectures certainly sounds like | the more _useful_ deployment, but the research seems to | be pushing towards ludicrously large models. | | You seem to know a fair amount about the field, is there | anything you'd suggest if I wanted to read more into the | subject? | ml_hardware wrote: | I agree! The models will definitely keep getting bigger, | and MoEs are a part of that trend, sorry if that wasn't | clear. | | A pod of gen2-H100s might have 256 GPUs with 40 TB of | total memory, and could easily run a 10T param model. So | I think we are far from diminishing returns on the | hardware side :) The model quality also continues to get | better at scale. | | Re. reading material, I would take a look at DeepSpeed's | blog posts (not affiliated btw). That team is super super | good at hardware+software optimization for ML. See their | post on MoE models here: https://www.microsoft.com/en- | us/research/blog/deepspeed-adva... | algo_trader wrote: | Is it difficult/desirable to squeeze/compress an open- | sourced 200B parameter model to fit into 40GB? | | Are these techniques for specific architectures or can | they be made generic ? | algo_trader wrote: | Ah, found some stuff already | | https://www.tensorflow.org/model_optimization/guide/pruni | ng | | https://www.tensorflow.org/model_optimization/guide/pruni | ng/... | ml_hardware wrote: | I think it depends what downstream task you're trying to | do... DeepMind tried distilling big language models into | smaller ones (think 7B -> 1B) but it didn't work too | well... it definitely lost a lot of quality (for general | language modeling) relative to the original model. | | See the paper here, Figure A28: https://kstatic.googleuse | rcontent.com/files/b068c6c0e64d6f93... | | But if your downstream task is simple, like sequence | classification, then it may be possible to compress the | model without losing much quality. | learndeeply wrote: | GPT-3 can't fit in 80GB of RAM. | savant_penguin wrote: | 1 petaflop on a chip?? What is the catch? | dragontamer wrote: | Tensor petaflops are useful in only very few circumstances. One | of which is the highly lucrative deep learning community | though. | cjbgkagh wrote: | The main tensor op is a matmul intrinsic which is useful for | way more than just deep learning. | | Edit; many of these speeds are low precision which is less | useful outside of deep learning, but the higher precision | matmul ops in the tensor cores are still very fast and very | useful for wide variety of tasks. | dragontamer wrote: | > but the higher precision matmul ops in the tensor cores | are still very fast and very useful for wide variety of | tasks. | | The FP64 matrix-multiplication is only 60 TFlops, no where | near the advertized 1000 TFlops. TF32 matrix-multiplication | is a poorly named 16-bit operation. | cjbgkagh wrote: | You are indeed correct, I was (kinda) fooled by the | marketing and I think that TF32 is deceptively named. I | think the tensor cores are being used in this | architecture for FP64 and 60 TFlops is still pretty | decent. | | I'm on Turing architecture so I've never used TF32. I've | only used FP32 and FP16 but FP32 isn't supported by these | tensor cores. | bcatanzaro wrote: | Well the addition is done in FP32, and it's a 32-bit | storage format in memory, so calling it a 16-bit format | isn't right either. It's really a hybrid format where | everything is 32-bit except multiplication. | | Given that it's 32-bit in memory (so all your data | structures are 32-bit) and also that in my experience | using it is very transparent (I haven't run into any | numerical issues compared to full FP32), I think calling | it a 32-bit format is a reasonable compromise. | dragontamer wrote: | > Well the addition is done in FP32 | | Addition is done in 10-bit mantissa. So maybe TF19 might | be the better name, since its a 19-bit format (slightly | more than 16-bit BFloats). | | Really, its a BFloat with a 10-bit mantissa instead of a | 7-bit mantissa. 10-bit mantissa matches FP16, while the | 8-bit exponent matches FP32. | | So TF19 probably would have been the best name, but | NVidia like marketing so they call it TF32 instead. | bcatanzaro wrote: | It's a 32-bit format in memory and the additions are done | with 32-bits. | dragontamer wrote: | I admit that I don't have the hardware to test your | claims. But pretty much all the whitepapers I can find on | TF32 explicitly state the 10-bit mantissa, suggesting | that this is at best, a 19-bit format. 1-bit sign + 8-bit | exponent + 10-bit mantissa. | | Yes, the system will read/write the 32-bit value to RAM. | But if there's only 10-bits of mantissa in the circuits, | you're only going to get 10-bits of precision (best | case). The 10-bit mantissa makes sense because these | systems have FP16 circuits (1 + 5-bit exponent + 10-bit | mantissa) and BFloat16 circuits (1 sign + 8-bit exponent | + 7-bit mantissa). So the 8-bit exponent circuit + 10-bit | mantissa circuit exists physically on those NVidia cores. | | ------- | | But the 'Tensor Cores' do not support 32-bit (aka: 23-bit | mantissa) or higher. | my123 wrote: | Yup, in a semi-related field, NVIDIA has 3xTF32 for cases | needing higher precision: | https://github.com/NVIDIA/cutlass/discussions/361 | touisteur wrote: | There's a paper on getting fp32 accuracy using tf32 | tensor cores and losing 3x efficiency. Can't wait to try | it with cutlass... once I get how to use cutlass, woof. | peter303 wrote: | DP Linpack flops is what counts in supercomputer ranking. Stuck | at .44 Exoflops in 2021. | aninteger wrote: | Given that it's Nvidia, no Linux support. That's the catch. | jamesfmilne wrote: | All the AI software running on these data-centre chips is | almost exclusively running on Linux. | | I wish people would stop talking rubbish about NVIDIA's Linux | support. | chockchocschoir wrote: | That's because nvidias linux support for consumers is | indeed trash, while their creators/business/creatives | software (eg CUDA) is not trash, but you mostly hear | consumers trashing nvidia. | pjmlp wrote: | Only FOSS zealots actually, the rest of us is quite ok | with their binary drivers. | oblio wrote: | They don't make (relevant) money from consumer hardware | on Linux. | ScaleneTriangle wrote: | I thought that only applied to their consumer products. | jsheard wrote: | Their consumer products have Linux support too, the catch | is just that the drivers are proprietary binary blobs | TheRealSteel wrote: | Don't they provide Linux drivers for their gaming graphics | cards too, just not open source? | gpm wrote: | Yes | AHTERIX5000 wrote: | No Linux support? Guess I'll have to keep using Solaris with | my A4000! | simulate-me wrote: | Nvidia provides Linux drivers for their server chips. | TheRealSteel wrote: | Don't they provide them for their consumer cards too, just | that it's a closed source binary blob? | throw0101a wrote: | And not just Linux: FreeBSD. | | * https://www.nvidia.com/en- | us/drivers/unix/freebsd-x64-archiv... | | * https://www.freshports.org/x11/nvidia-driver | | Heck, _Solaris_ : | | * https://www.nvidia.com/en-us/drivers/unix/solaris- | display-ar... | | * https://www.nvidia.com/en-us/drivers/unix/ | jxy wrote: | CUDA and related software/libraries only work on Linux or | Windows. | lostmsu wrote: | Some are even Linux only like nccl (AFAIK required to | fully use NVLink) | kcb wrote: | That's a strange statement. The vast vast majority of these | cards will be in systems running Linux. | savant_penguin wrote: | I for one suffer deeply when I try to install the nvidia | drivers on Linux. The website binaries _always_ break my | system | | Only the ppas from graphics-drivers work properly | | My experience on windows is much more automatic and it | never breaks anything. But I'd rather pay the price | (installing on Linux) to avoid windows at all costs | riotnrrd wrote: | If you installed the drivers using the PPAs, you can't | then update using the NVIDIA-provided binaries without | doing a very thorough purge, including deleting all | dependent installs (CUDNN, CUBLAS, etc.) | | I highly recommend sticking with one technique or the | other; never intermix them. | kcb wrote: | Yea it's not ideal but really no option is. Built in to | Linux would be a problem too given the rate of GPU driver | development. Most Linux installs in the corporate world | are stuck on the major version of the kernel and system | packages they shipped with. | hughrr wrote: | 700 watts so being NVidia it'll blow up in 6 months and you'll | need to wait in a queue for 6 months to RMA it because all the | miners had bought up the entire supply chain. | touisteur wrote: | Those datacenter/hpc GPUs don't seem to get bought so much by | the mining community? I don't have problems sourcing some | through the usual channels (HPE, dell,...?). But you need | somehow deep pockets. | p1esk wrote: | The catch is it's only for TF32 computations (Nvidia | proprietary 19 bit floating point number format) | cjbgkagh wrote: | I missed that, to me that makes the '32' in the name | misleading. | p1esk wrote: | TF32 = FP32 range + FP16 precision | cjbgkagh wrote: | Why not call it TF19 then. | 37ef_ced3 wrote: | Because it's 32-bits wide in memory. | | The effective mantissa is like FP16 but it's padded out | to be the same size as FP32. | | In other words, there's 1 sign bit, 8 exponent bits, 10 | mantissa bits that are USED, and 13 mantissa bits that | are IGNORED. | | 1 + 8 + 10 + 13 = 32 | | The 13 ignored mantissa bits are part of the memory | image: they pad the number out to 32-bit alignment. | cjbgkagh wrote: | But the user never sees that memory right? Doesn't it go | in FP32 and come out FP32? I still think it's deceptive | marketing. | bcatanzaro wrote: | The user does see 32-bits and all bits are used because | all the additions (and other operations besides the | multiply in matrix ops) are in FP32. So the bottom bits | are populated with useful information. | p1esk wrote: | Because your existing FP32 models should run fine when | converted to TF32, so TF32 is equivalent to FP32 as far | as DL practitioners are concerned. | cjbgkagh wrote: | There is a lot of redundancy in DL that forgives all | manner of sins; think it's sneaky. | fancyfredbot wrote: | The Tensor cores will be great for machine learning and the | FP32/FP64 fantastic for HPC, but I'd be surprised if there were a | lot of applications using both of these features at once. I | wonder if there's room for a competitor to come in and sell | another huge accelerator but with only one of these two features | either at a lower price or with more performance? Perhaps the | power density would be too high if everything was in use at once? ___________________________________________________________________ (page generated 2022-03-22 23:00 UTC)