[HN Gopher] ROCm is AMD's priority, executive says ___________________________________________________________________ ROCm is AMD's priority, executive says Author : mindcrime Score : 215 points Date : 2023-09-26 17:54 UTC (5 hours ago) (HTM) web link (www.eetimes.com) (TXT) w3m dump (www.eetimes.com) | halJordan wrote: | The first step is admitting there's a problem. So... that's nice. | ethbr1 wrote: | Exactly. People might trust AMD if they continue to invest in | this for the next 10 years. | | It's clear it wasn't a corporate priority. Convince people it | is via sustained action and investment, and _eventually_ they | might change their minds. | clhodapp wrote: | If they were serious, they would start something like drm/mesa | but for compute and it would just work out of the box with a | stock Linux kernel. | HideousKojima wrote: | Only 16 years after Nvidia released CUDA | grubbs wrote: | I remember chatting with some Nvidia rep at CES 2008. He showed | me how cuda could be used to accelerate video upscale and | encoding. I was 19 at the time and just a hobbyist. I thought | that was the coolest thing in the world. | | (And yes I "snuck" in to CES using a fake business card to get | my badge) | gdiamos wrote: | Back in the day, using CUDA was really hard. It got better as | more people built on it and it got battle tested. | hyperbovine wrote: | It's still not exactly easy, and the API has not changed | much since the aughts except than to become richer and more | complicated. But almost nobody writes raw CUDA anymore. | It's abstracted away beneath many layers of libraries, e.g. | Flax -> Jax -> lax -> XLA -> CUDA. | Dah00n wrote: | You remind me of one of those kind of people who are part of | "team green" or an Apple Fan. People that wish nothing more | than to see "the others" fail. A win for their team is good but | a fail of the other team is the best thing ever and make them | feel all giddy inside. | jacquesm wrote: | What a useless comment. It is you that drives the fire, I | would be more than happy with a bit more competition. The sad | reality is that right now if you want to focus on your job | and not on the intermediary layers that NV is pretty much the | only game in town. The 'Team Green' bs came out of the gaming | world where people with zero qualifications were facing off | with other people with zero qualifications about whose HW was | 'the best' when 'the best' meant: I can play games. But this | is entirely different, it is about long and deep support of a | complex hardware/software combo where whole empires are built | upon that support. Those are not decisions made lightly and | unfortunately AMD has done very poorly so far. This | announcement is great but the proof of the pudding will be in | the eating, so let's see how many engineers they dedicate to | delivering top notch software. | HideousKojima wrote: | The hilarious thing is I'm actually an AMD fanboy, I've made | a point to only get their GPUs (and CPUs) for the last decade | or so. But I'm still annoyed and frustrated that it's taken | them so long to get their act together on this. | Havoc wrote: | I've concluded they're just allergic to money. | | Even after it became very clear that this is going to be big | they're still slow off the block as if they're not even trying. | | e.g. Why not make a list of the top 500 people in AI field and | send them cards no strings attached plus as good of low level | documentation as you can muster. Insignificant cost to AMD but | could move the mindshare needle if even 20 of the 500 experiment | and make some noise about it in their circles. | | The Icewhale guys did exactly that best as I can tell. 350k USD | hardware kickstarter so really lean. Yet all the youtubers even | vaguely in their niche seem to have one of their boards. It's a | good board don't get me wrong, but there is no way that was | organic. Some sharp marketeer made sure the right people have the | gear to influence mindshare. | | https://www.youtube.com/results?search_query=zimaboard | [deleted] | treprinum wrote: | I suspect it's because they don't want to pay for software | engineers as hardware engineers are much cheaper. I was | contacted by their recruiter last year and it turned out the | principal engineer salary was at the level of entry FAANG | salary, so I suspect they can't really source the best people. | jjoonathan wrote: | My suspicion is that the GPGPU hardware in shipped cards has | known problems / severe limitations due to neglect of that side | of the architecture for the last ~10 years. Shipping a bunch of | cards only to burn the next generation of AMD compute fans as | badly as they burned the last generation of AMD compute fans | would _not_ be wise. It 's painful to wait, but it may well be | for the best. | freeone3000 wrote: | ROCm on Vega only works on certain motherboards because the | card lacks a synchronization clock over the PCI bus. They | added it on _some_ later cards. It's absurd how much is | lacking and inconsistent. | gdiamos wrote: | Instinct has much better SW support today than Radeon, so you | would need to send MI210s/etc . | | I think it's at the point where if you are comfortable with | GEMM kernels, setting up SLURM, etc it is usable. But if you | want to stay at the huggingface layer or higher, you will run | into issues. | | Many AI researchers are higher level than that these days, | but some are still of us willing to go lower level. | spacecadet wrote: | Yeah, this. I tried to do some computing with AMD server | grade cards 2 years ago and found all of the API so out of | fate and the documentation equally out of date... Went CUDA | and didnt look back. Sad, cause Im an AMD fanboy of old. | tysam_and wrote: | It seems like Hotz and co are able to move pretty well on it, | so maybe there's some low-level stuff they're using (or maybe | they're forced to for a few reasons) w.r.t. the tinybox, but | it is impressive how much they've been able to do so far I | think. :3 <3 :')))) :') | simfree wrote: | The Radeon MI series seems to perform fine if you follow | their software stack happy path. Same for using modified | versions of ROCm on APUs, it's just no one has been willing | to invest in paying a few developers to work on broader | hardware support full-time, thus any bugs outside enterprise | Linux distros on Radeon MI series cards do not get triaged. | roenxi wrote: | > e.g. Why not... | | A key part of progress is choosing the direction to progress | in. Flashy knee-jerk moves like that sound good but it isn't | the fastest way to move forward. The first step (which I think | they've taken) is for the executives to align on what the | market wants. The second is to work out how to achieve it, the | third to do it. Handing out freebies would probably help, but | it'll take sustained long term strategy for AMD to make money. | | AMD's problem isn't low-level developer interest. The George | Hotz video rant on AMD was enlightening - the interest is there | and the official drivers just don't work. A few years ago I | made an effort to get in to reinforcement learning as a hobby | and was blocked by AMD crashes. At the time I assumed I'd done | something wrong. I still believe that, but I'm less certain | now. It is possible that the reason AMD is doing so poorly is | just that their code to do BLAS is buggy. | | People get very excited about CUDA and maybe everything there | is necessary, but on AMD the problem seems to be that the card | can't reliably multiply matrices together. I got some early | nights using Stable Diffusion because everything worked great | for an hour then the kernel paniced. I didn't give AMD any | feedback because I run an unsupported card and OS - effectively | all cards and OSs are unsupported - but if that is widespread | behaviour it would be a grave blocker. | | I think they are serious now though. The ROCM documentation | dropped a lot of infuriating corporate waffle recently and that | is a sign that good people are involved. Still going to wait | and see before getting too hopeful that it works out well. | jacquesm wrote: | > Flashy knee-jerk moves like that sound good but it isn't | the fastest way to move forward. | | NVidia: | | - Games -> we're on it | | - Machine learning -> we're on it | | - Crypto -> we're on it | | - LLM / AI -> we're on it | | Compare the growth rate of NVidia vs AMD and you get the | picture. Flashy knee-jerk moves are bad, identifying growth | segments in your industry and running with them is | _excellent_ strategy. | | People get excited about CUDA _because it works_ , and AMD | could have had a very large slice of that pie. | | > on AMD the problem seems to be that the card can't reliably | multiply matrices together. I got some early nights using | Stable Diffusion because everything worked great for an hour | then the kernel paniced. I didn't give AMD any feedback | because I run an unsupported card and OS - effectively all | cards and OSs are unsupported - but if that is widespread | behaviour[sic] it would be a grave blocker. | | Exactly. And with NVIDIA you'd be working on your problem | instead. And that's what makes the difference. AMD should do | exactly what the OP wrote: gain mindshare by getting at least | some researchers on board with their product, assuming they | haven't burned their brand completely by now. | seunosewa wrote: | NVIDIA is focused on graphic cards. AMD has the tough CPU | market to worry about. | jacquesm wrote: | That's AMD's problem to solve, they made that choice. | | NV doesn't have to worry about resource allocation, | branding etc. AMD could copy that by spinning out it's | GPU division. Note that 'graphic cards' is no longer a | proper identifier either, they just happen to have | display connectors on them (and not even all of them). | They're more like co-processors that you may also use to | generate graphics. But I'm not even sure if that's the | bulk of the applications. | TheCleric wrote: | Never half ass two things when you can whole ass one | thing. | gravypod wrote: | If this turns around it will be amazing but ROCm isnt the only | issue. The entire driver stack is important. If they came out | with virtualization support for their gpus (even if everyone paid | a 10% perf hit) they'd take over the cheap hosted gpu space which | is a huge market. | mindcrime wrote: | Getting proper (and official) ROCm support across their | consumer GPU line will be big as well. Hobbyists aren't buying | MI300's and their ilk. And surely AMD is better off if a would | be hobbyist (or low budget academic/industrial researcher) | chooses a Radeon card over something from NVIDIA! | | I'm about to buy a high-end Radeon card myself, gambling that | AMD is serious about this and will get it right, and that it | won't be a wasted purchase. So yeah, if I seem like an AMD fan- | boy (I am, somewhat) at least I'm putting my money where my | mouth is. :-) | | _AMD's software stacks for each class of product are separate: | ROCm (short for Radeon Open Compute platform) targets its | Instinct data center GPU lines (and, soon, its Radeon consumer | GPUs),_ | | They've been saying this for a while, and I'm encouraged by | reports that people "out there" in the wild have actually | gotten this to work with some cards, even in advance of the | official support shipping. So here's hoping they are really | serious about this point and make this real. | jauntywundrkind wrote: | Apologies for the snark, but maybe it's better that _so far_ | AMD has had terrible consumer card support. What little | hardware they have targeted seems to be barely stable & | barely work for the very limited workloads that are | supported. If regular consumers were told their GPUs would | work for GPGPU, they might be rotten pissed when they found | out what the real state of affairs is. | | But if AMD really wants a market impact - which is what this | submission is about - getting good support across a decent | range of consumer GPUs is absolutely required. They cannot | win this ecosystem battle with only datacenter mindshare. | auggierose wrote: | Yeah, don't. Buy an Nvidia and get shit done. | bryanlarsen wrote: | Easier said than done, at least for H100. | dotnet00 wrote: | They're talking about consumer cards, which is the point. | You can learn CUDA off any consumer nvidia card and have | it translate to the fancier gear, that's part of why | nvidia has so much mindshare. | | Eg I can write my cuda code with my 3090s, my boss can | test it on his laptop's discrete graphics, and then after | that we can take the time to bring it to our V100s and | A100s and nothing really has to change. | iforgotpassword wrote: | A bit harsh but I agree in that I only believe it when I | see it. Have been burned by empty promises by AMD before. | capableweb wrote: | For some people, it's not just about getting results or | "get shit done" but about the journey and learning on the | way there. Also, AMDs approach to openness tends to be a | bit better than NVIDIA, so there's that too. And since | we're on _Hacker_ News after all, an AMD GPU for the hacker | betting on the future seems pretty fitting. | bravetraveler wrote: | For someone using Linux, an AMD card may be even better | suited for 'getting things done' | | Wayland and many things _outside of GPGPU_ are much | better; ie: power control /gating/monitoring are all | available over _sysfs_. You can over /underclock a fleet | of systems with traditional config management. | | GPGPU surely deserves some weight given the context of | the thread, but let's not ignore the warts Nvidia shows | elsewhere. | mindcrime wrote: | I get where you're coming from, and in fact I am planning | to also build an NVIDIA based ML box as well. But I | pointedly want to support AMD here for a variety of | reasons, including an ideological bias towards Open Source | Software, and a historical affinity for AMD that dates back | to the mid 90's. | Conscat wrote: | AMD's debuggers and profilers let you disassemble | kernel/shader machine code and introspect registers and | instruction latency. That's something at least that Nvidia | doesn't do with Nsight tools. | jauntywundrkind wrote: | Virtualization is such a key ability. I really really lament | that it's been tucked away, in a couple specific products (The | last MxGPU is, what, half a decade old? More? Oh I guess they | finally spun off a new one, an RDNA2 V620!). | | I keep close & cherish a small hope that for some use-cases we | might get a soft virtualization-alike that just works. I don't | know enough to say how likely this is to adequately work, but | in automotive & some other places there are nested Waylands, | designed to share hardware. You still need a shared OS layer, a | shared kernel, and a compositor that manages all the | subdesktops - this isn't full virtualization - but | hypothetically you get something very similar to | virtualized/VDI gpus, if you can handle the constraints. | | This is really a huge huge huge shift that Wayland has | potentially enabled, by actually using kernel resources like | DMA-BUFs and what not, where apps can just allocate whatever & | pass the compositor filehandles to the bufs. Wayland is ground | up, unlike X's top down. So it's just a matter of writing | compositors smart enough to push what data from whom needs to | get rendered and sent out where. | | I would love to know more about what hardware virtualization | really buys, know more about the limitations of what VDI is | possible in software. But my hope is, in not too long, there's | good enough VDI infrastructure that it's basically moot whether | a gpu has hardware support. There will be some use cases where | yes every users needs to run their own kernel & OS, and that | won't be supported (albeit virtio might workaround even that | quite effectively), but for 95% of use cases the more modern | software stack might make this a non-issue. And at that point, | these companies might stop having such expensive-ass product | segmentation, charging 3x as much to have a couple hardware | virtual devices, since in fact it costs them essentially | nothing & the software virtualization is so competitive. | 01100011 wrote: | As far as I understand it, AMD basically has to do this because | games are going to increasingly rely on LLMs & generative AI | operating simultaneously with the graphics pipeline. | imbusy111 wrote: | It has nothing to do with games. The market outside of games | for compute is much bigger at the moment with the AI hype, and | AMD is positioned to take a good slice of it, if they get their | software stack in order. | alex21212 wrote: | Rocm and amd drives me nuts. The lack of support for consumer | cards and the hassle of getting basic things in pytorch to just | work was too much. | | I was burned by support that never came for my 6800xt. Recently | went back to NVIDIA with a 4070 for pytorch. | | I hope amd gets their act together with rocm but I'm not going to | buy an AMD GPU until they do fix it rather than just vaguely | promise to add support some day ... | zucker42 wrote: | Exactly. I recently started a NN side project. The process for | setting up PyTorch was to run `pacman -S cuda` and `pip install | torch`. I was using a GTX 1060. If it was a project with a | bigger budget, I could have rented servers from AWS with all | the software preinstalled in no time. I don't even know if it | would have been possible for me to do it with AMD, even if I | owned an AMD graphics card. | | People like me are small potatoes to AMD, but surely it's hard | to make significant inroads when it's impossible for anyone to | learn or do small projects on ROCM, and big projects can't rely | on ROCM just working. | jacquesm wrote: | People like you are small potatoes until you have some | measure of success and then suddenly you're burning up GPU | hours by the truckload and whatever you're used to you will | continue using. | Tsiklon wrote: | I think AMD need to do something BIG in the enterprise space. It | seems Nvidia have the Lion's Share of the Market, but Intel have | been making good strides there with their DC GPUs. | | The software stack is the key here. If the drivers aren't there | it doesn't matter what paper capabilities your product has if you | can't use it. | | AMD have on paper done well with performance in recent | generations of consumer cards but their drivers universally seem | to be the let down to making the most of their architecture. | therealmarv wrote: | they have! On one of the last keynotes in Summer they announced | direct competitor to chips from Nvidia AI chips for | enterprises: MI300X | | https://www.anandtech.com/show/18915/amd-expands-mi300-famil... | | Software stack is crucial of course but if you buy this kind of | chips (means you have a lot of money) you probably can also | optimise your stack for it for some extra bucks to not rely on | Nvidia's supply. | vegabook wrote: | With all due respect this is an insult to those of us who have | loyally purchased AMD for numerous years, trying our very best to | do compute with days, nay weeks, of attempts. | | Now 5 years too late we get told its suddenly their number one | priority. | | Too late. Not only has all goodwill gone, but it's in deep | negative territory. Even 50% lower performance stacks like Intel | / Apple are much more appealing than AMD will ever be at this | stage. | capableweb wrote: | "senior VP of the AI group at AMD", said at a "AI Hardware | Summit" that "My area is AMDs No. 1 Priority". | | Tell me when the rest of the company aligns with you and has | started to show any results in providing a good experience for | people to do machine learning with AMD. As it stands right now, | there is so much tooling missing, and the tooling that's there is | severely lacking. | | But, I have a faith. They've reinvented themselves with CPUs, | multiple times, so why not with GPUs, again? | mindcrime wrote: | _Tell me when the rest of the company aligns with you_ | | More or less the same message has been promulgated[1][2] by no | less than Lisa Su[3], FWIW. | | [1]: https://www.phoronix.com/news/Lisa-Su-ROCm-Commitment | | [2]: https://www.forbes.com/sites/iainmartin/2023/05/31/lisa- | su-s... | | [3]: https://en.wikipedia.org/wiki/Lisa_Su | no_wizard wrote: | The inevitable fight here is between ROCm which may have, 100s of | AMD engineers working on it and related verticals, at best, | without significant changes at the company, plus whatever | contributions they can muster from the community. | | I think at least headcount check, CUDA had _thousands_ of | engineers working on it and related verticals. | | I know there's a philosophy that states, eventually, open source | eats everything, however, this one seems like there is so much | catch up that AMD will need to spend big and fast to get off the | ground competitively. | [deleted] | martinald wrote: | It's absolutely mindboggling to me that AMD is still struggling | so badly on this. | | There is an absolutely enormous market for AMD GPUs for this, but | they seem to be completely stuck on how to build a developer | ecosystem. | | Why aren't AMD throwing as many developers as possible submitting | PRs for the open source LLM effort adding ROCm support, for | example? | | It would give AMD real world insights to the problems with their | drivers and SDKs as well, which are incredibly numerous. | | People would be willing to overlook a huge amount of jank for | cheap(er) cards with large VRAM configurations. I don't think | they when need to be particularly fast, just have the VRAM | needed, which I'm sure AMD could put specialist cards together | for. | hedgehog wrote: | Historically they believed that "the community" would address | broader ML software support. I think the idea was they could | assign dedicated engineers for bigger customers and together | that was a sort of Pareto-goodish solution given their | constraints as a company. Even in retrospect I'm not sure if | that was a good call or not. | Almondsetat wrote: | I mean, they _would_ be right if all their cards, both | consumer and enterprises, supported the same programming | interface. | | You cannot trust the community to do the work for you but | then only make the software available for $Xk dollar cards | ryukoposting wrote: | s/OpenCL/ROCm/g | pixelpoet wrote: | Oh man, this is exactly what I want to see on HN frontpage! | | I commented on another article about an AMD chip that had no | OpenCL support that it made it dead in the water for me, and was | downvoted; surely everyone understands how important CUDA is, and | everyone should understand how important open standards are (e.g. | FreeSync vs Nvidia's GSync), so I can't understand why more | people don't share my zeal for OpenCL. | | I've shipped two commercial products based on it which still | works perfectly today on all 3 desktop platforms from all GPU | vendors... what's not to love? | tysam_and wrote: | If they can make a 288 GB $4.4-6.8k prosumer, home-computer- | friendly graphics card, I will be extremely happy. Might be a | pipe dream (today at least, lol, and standard in like...what, 5 | years?), but if they can pull that off, then I think things | would really change a lot. | | I don't care if it's slow, bottom-of-the-barrel GDDR6, or | whatever, just being able to enter the high-end model | finetuning & training regime for ML models on a budget | _without_ dilly-dallying with multiple graphics cards (a | monstrous pain-in-the-neck from a software, engineering, & | experimentation perspective)_ would enable so much large-scale | development work to happen. | | The compute is extremely important, and in most day-to-day | usecases, the memory bandwidth even moreso, but boy oh boy | would I love to enter the world offered by a large unified card | architecture. | | (Basically, in my experience, parallelizing a model across | multiple GPUs is like compiling from code to a binary -- | technically you can 'edit' it, but it's like directly hex | editing strings in a binary blob, extremely limited. Hence why | I try to stick with models that take only a few seconds | (minutes at most) to train on highly-representative tasks, | distill first principles, and then expand and exploit that to | other modalities from there). | Conscat wrote: | OpenCL isn't very useful now that we have Vulkan. Its biggest | advantage is that there exist C++ compilers for its kernels. | But AMD's OpenCL runtime inserts excessive memory barriers not | required by the spec (they won't fix this due to Hyrum's Law) | and Vulkan gives you more control over the memory allocation | and synchronization anyways. If we had better Vulkan shader | compilers, OpenCL would serve basically no purpose, at least | for AMD hardware. | cpill wrote: | AI libs could use it and we'd break the bonds in CUDA. Also | Rust might get an implementation which would give it they | non-intervention to overtake C++ | pjmlp wrote: | No it wouldn't, until it provides the same polyglot support | and graphical tooling as CUDA. | | At least Intel is trying with oneAPI into that direction. | raphlinus wrote: | Yeah, that's a big if. In theory there's nothing preventing | good compilation to Vulkan compute shaders, in practice | people just aren't doing it, as CUDA actually works today. | | I also agree that Vulkan is more promising than OpenCL. With | recent extensions, it has real pointers (buffer device | address), cooperative matrix multiplication (also known as | tensor cores or WMMA), scalar types other than 32 bits, | proper barrier (including device-scoped, needed for single | pass scan), and other important features. | 20k wrote: | Its not that they're supporting buggy code, they just | downgraded the quality of their implementation significantly. | They made the compiler a lot worse when they swapped to rocm | | https://github.com/RadeonOpenCompute/ROCm-OpenCL- | Runtime/iss... is the tracking issue for it filed a year ago, | which appears to be wontfix largely because its a lot of work | | OpenCL still unfortunately supports quite a few things that | vulkan doesn't, which makes swapping away very difficult for | some use cases | parl_match wrote: | > I can't understand why more people don't share my zeal for | OpenCL. | | When I last worked with it, it was difficult, unstable, and | performed poorly. CUDA, on the other hand, has been nothing but | good (at least). Well, nvidia pricing aside ;) | | OpenCL might be a lot better now, but for a lot of us, we | remember when it was actively a bad choice. | Vvector wrote: | But is this just more BS from AMD? | | https://www.bit-tech.net/reviews/tech/cpus/amd-betting-every... | AMD Betting Everything on OpenCL (2011) | jjoonathan wrote: | I'm pretty sure the NVDA pump finally convinced the AMD board | / C-Suite to prioritize this, but it takes time to steer a | big ship. I'm hopeful, but there are still bad incentives to | jump the gun on announcements so I'll let others take the | plunge first. | kldx wrote: | > I've shipped two commercial products based on it which still | works perfectly today on all 3 desktop platforms from all GPU | vendors... what's not to love? | | In my experience, if commercial products involved any sort of | hand-optimized, proprietary OpenCL, one would be shocked by the | lack of documentation and zero consistency across AMD's GPUs. | Intel has SPIRV and Nvidia has PTX and this works pretty well. | But some AMD cards support SPIR or SPIRV, and some don't and | this support matrix keeps changing over time without a single | source of truth. | | Throw in random segfaults inside AMD's OpenCL implementation | and you have a fun day debugging! | | Dockerizing OpenCL on AMD is another nightmare I don't want to | get into. Intel is literally installing the compute runtime and | mapping `/dev/dri` inside the container. On paper, AMD has the | same process but in reality I had to run `LD_DEBUG=binding` so | many times just to figure out why AMD runtime breaks inside | docker. | | There may be great upsides to AMD's hardware in other domains | though | jjoonathan wrote: | For a long time, AMD promoted OpenCL as viable without it | actually being viable. This leaves scars and resentment. Mine | come from about 10 years ago. They run deep. | | I'm glad to hear your experience was better, but I'm fresh out | of trust. This time, I need to see major projects in my | application areas working on AMD _before_ I buy, because AMD | has taught me that "trust us" and "just around the corner" can | mean "10 years later and it still hasn't happened." I'm pretty | sure that this time _is_ different, but the green tax is dirt | cheap compared to learning this lesson the hard way, so I 'm | letting others jump first this time. | gdiamos wrote: | Relevant, we deployed Lamini on hundreds of MI200 GPUs. | | Lisa tweet: https://x.com/LisaSu/status/1706707561809105331?s=20 | | Lamini tweet: | https://x.com/realSharonZhou/status/1706701693684154766?s=20 | | Blog: https://www.lamini.ai/blog/lamini-amd-paving-the-road-to- | gpu... | | Register: | https://www.theregister.com/2023/09/26/amd_instinct_ai_lamin... | CRN: https://www.crn.com/news/components-peripherals/llm- | startup-... | | The hard part about using any AI Chips other than NVIDIA has been | software. ROCm is finally at the point where it can train and | deploy LLMs like Llama 2 in production. | | If you want to try this out, one big issue is that software | support is hugely different on Instinct vs Radeon. I think AMD | will fix this eventually, but today you need to use Instinct. | | We will post more information explaining how this works in the | next few weeks. | | The middle section of the blog post above includes some details | including GEMM/memcpy performance, and some of the software | layers that we needed to write to run on AMD. | mardifoufs wrote: | What's the cost benefit vs. Nvidia? Is it cheaper? | light_hue_1 wrote: | You simply cannot buy nvidia GPUs at scale at the moment. | We're getting quotes that are many months out, sometimes even | a year+ out. | gdiamos wrote: | We kept hearing 52 weeks for new shipments. | gdiamos wrote: | Available in orders of up to 10,000 GPUs today - no shortage | | More than 10x cheaper than allocating machines on a tier 1 | cloud - AWS, Azure, GCP, Oracle, etc | | More memory - 128GB HBM per GPU - means bigger models fit for | training/inference without the nightmare of model parallelism | over MPI/infiniband/etc | | Longer term - finetuning optimizations | mardifoufs wrote: | Ah! The memory sounds interesting. How would that compare | to similar Nvidia hardware w.r.t cost assuming the hardware | was available? | | Does AMD provide something similar to nvlink, and even | libraries like cudnn? | | Also, last I checked none of the public clouds offered any | of the latest gens MI GPUs, so I wasn't aware that it had | good availability! Azure had a preview but I'll look more | into it now. | | Thank you for your answer btw! | gdiamos wrote: | Yeah getting around the no public cloud thing was really | annoying. We had to build our own datacenter. | | On the plus side, it was drastically cheaper and now we | can just slot in machines. | | I would prefer that a tier 1 cloud made MI GPUs available | though. It would make it so much more accessible. | gdiamos wrote: | See the memory size comparison (GB) in this table: https: | //en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces... | tbihl wrote: | It blows my mind that A100 and H100 are each safely below | 1000W power draw. | gardnr wrote: | The classic economic benefits of competition: | | * Drives down price | | * Enhances product features (I see them competing on VRAM | first) | | * Helps to insulates buyers from supply issues | | Nvidia has kneecapped their consumer grade hardware to ensure | the gaming market still has scraps to buy in spite of crypto | mining and the AI gold rush. All AMD would have to do to eat | into Nvidia marketshare is remove the hardware locks in low- | end cards and ship one with 64GB+ of VRAM. | | This of course would only work if they have comparable/usable | software support. Any improvements to ROCm will be a boon for | any company that doesn't already have or can't afford huge | farms of high-end Nvidia chips. | jauntywundrkind wrote: | > _If you want to try this out, one big issue is that software | support is hugely different on Instinct vs Radeon. I think AMD | will fix this eventually, but today you need to use Instinct._ | | I'm really really worried about AMD, and whether they're going | to care about anyone else. They might just care about Instinct, | where margins are so high, and ignore consumer cards or making | more friction and segmentation for consumer cards. | | Part of what made CUDA so successful was that the low hardware | barrier to entry created such a popular offering. Everyone used | it. I really hope AMD realizes that, and really hope AMD | invests in consumer card software too. Just making it work on | the high end doesn't seem enough to get the kind of mass- | movement ecosystem success AMD really needs. I'm afraid they | might go for a smaller win, try to compete only at the top. | dotnet00 wrote: | It's nice to hear that there are actual results to show, since | AMD execs simply saying that ROCm is a priority isn't really | convincing anymore given their track record on claims regarding | support on the consumer side. | viewtransform wrote: | The difference this time is that the executive is from | Xilinx. Xilinx has had an AI software development team for a | while in the FPGA space. | | AMD has had poor management in the GPU computing space since | Raja Koduri's time (he put the best engineering resources on | VR during his tenure and ignored deep learning). Subsequent | directors have not had a long term vision and left within a | few years. | | Looks like Lisa Su has corrected this now - they seem to have | moved AMD software engineers en masse to work under Xilinx | management on AI. Remains to be seen if this new management | hierarchy will have a better vision and customer focus. | varelse wrote: | [dead] | tbruckner wrote: | I would really hope you could get decent utilization on ops as | fundamental as GEMM/memcpy on a single device. Translating that | to MFU is a completely different story. | gdiamos wrote: | We get good utilization at scale as well. Typically 30-40% of | peak at the full application level for training and | inference. | | Perf isn't the biggest problem though, many AI chips can do | this or a bit better on benchmarks, if you invest the | engineering time to tune the benchmark. | | The really hard part is getting a complete software stack | running. | | It took us over 3 years because many of the layers just | didn't exist, e.g. scale out LLM inference service that | supports multiple requests with fine-grained batching across | models distributed over multiple GPUs. | | On Instinct, ROCm gets you the ability to run most pytorch | models on one GPU assuming you get the right drivers, | compilers, framework builds, etc. | | That's a good start, but you need more to serve a real | application. | mgaunard wrote: | People have been using their GPGPUs for decades on a | variety of scientific applications, and there are all kinds | of hybrid and multi-device frameworks that exist (often | supporting multiple backends). | | The difference is that it didn't get a lot of love as part | of the overhyped python LLM movement. | gdiamos wrote: | Completely agree, I'd love to see some of the innovations | from HPC move over into their LLM stack. | | We are working on it, but it takes time. | | Contributions to foundational layers like ROCBlas, | pytorch, slurm, Tensile, huggingface, etc would help. | dauertewigkeit wrote: | With all this hype about CUDA, I have recently started looking | into programming CUDA as a job as I love that kind of challenge, | but to my dismay I found that these tasks are very niche. So it | is not even that people are routinely writing new CUDA code. It's | just that the current corpus is too big and comprehensive for | alternatives to compete with. | jacquesm wrote: | That and a massive amount of experience already out there on | how to optimize for that particular architecture. NVidia has | done well for itself on the back of four sequential very good | bets coupled with dedication unmatched by any other vendor, | both on the hardware and on the software side. It also was one | of the few times that I didn't care if I ran the vendor | supplied closed source stuff because it seemed to work just | fine and I never had the feeling they would suddenly drop | support for my platform. | coder543 wrote: | Specialized skills can have a fairly small job market | sometimes. I think a lot of CUDA code ends up being | foundational as part of popular libraries, supporting tons of | applications that never need to write a single line of CUDA | themselves. | ckastner wrote: | The Debian ROCm Team [1] has made quite a bit of progress in | getting the ROCm stack into the official Archive. | | Most components are already packaged, the next big target is | adding support to the PyTorch package. | | Many of the packages are older versions; this is because getting | broad coverage was prioritized. The other next big target that is | currently being worked on is getting full ROCm 5.7 support. | | I fully expect Debian 13 (trixie) to come with full ROCm support | out-of-the-box, and as a consequence, also derivatives to have | support (Ubuntu above all). In fact, there will almost certainly | be backports of ROCm 5.7 to Debian 12 (bookworm) within the next | few months, so one will be able to just $ sudo | apt-get install pytorch-rocm | | One current obstacle is infrastructure: the Debian build and CI | infrastructures (both hardware and software) were not designed | with GPUs in mind. This is also being worked on. | | Edit: forgot to say that the CI infra that the Team is setting up | here tests all of these packages on consumer cards, too. So while | there may not be _official_ support for most of these, upstream | tests passing on the cards within the infra should be a good | indication for _practical_ support. | | [1] https://salsa.debian.org/rocm-team/ | avcxz wrote: | I'd also like to point out that ROCm has been packaged for Arch | Linux since the beginning of 2023, with efforts starting since | March 2020 [1]. | | Currently on Arch Linux you can run the following successfully: | $ sudo pacman -S python-pytorch-rocm | | Arch Linux even has ROCm support with blender. | | [1] https://github.com/rocm-arch | mgaunard wrote: | AMD has a history of providing sub-par software, and their | strategy of (partially) opening up their specifications and have | other people write it for free didn't work either. | | Nvidia has huge software teams, and so does Intel. | mindcrime wrote: | I don't know if they'll ultimately succeed or not, but they at | least seem to be putting genuine effort into this. ROCm | releases are coming out at a relatively nice clip[1], including | a new release just a week or two ago[2]. | | [1]: https://github.com/RadeonOpenCompute/ROCm/releases | | [2]: https://www.phoronix.com/news/AMD-ROCm-5.7-Released | Vvector wrote: | Yeah, AMD is doing more with ROCm. But are they catching up | to Nvidia, or just not falling behind as fast as before? Only | time will tell | dagw wrote: | Not only sub-par software, but sub-par software that they drop | support for after a couple of years. People can work around the | problems with sub-par software if they believe that it will | benefit them long term. They will absolutely not put in the | effort if they fear it will be completely useless in 2 years | time. | raphlinus wrote: | ROCm makes me sad, as it reminds me of how much better GPUs could | be than they are today. | | I've lately been exploring the idea of a "Good Parallel | Computer," which combines most of the agility of a CPU with the | efficient parallel throughput of a GPU. The central concept is | that the decision to launch a workgroup is made by a programmable | controller, rather than just being a cube of (x, y, z) or | downstream of triangles. A particular workload it would likely | excel at is sparse matrix multiplication, including multiple | quantization levels like SpQR[1]. I'm hopeful that it could be an | advance in execution model, but also a simplification, as I | believe a lot of the complexity of the current GPU model is | because of lots of workarounds for the weak execution model. | | I'm not optimistic about this being built any time soon, as it | requires rethinking the software stack. But it's fun to think | about. I might blog about it at some point, but I'm also | interested in connecting with people who have been thinking along | similar lines. | | [1]: https://arxiv.org/abs/2306.03078 | johncolanduoni wrote: | How does this differ from CUDA's dynamic parallelism, which | lets you launch kernels from within a kernel? | raphlinus wrote: | There are a lot of similarities, but the granularity is | finer. The idea is that you make a decision to launch one | workgroup (typically 1024 threads) when the input is | available, which would typically be driven by queues, and | potentially with joins as well, which is something the new | work graph stuff can't quite do. Otherwise the idea of stages | running in parallel, connected by queues, is similar. But I | did an analysis of work graphs and came to the conclusion | that it wouldn't help with the Vello (2d vector graphics) | workload at all. | JonChesterfield wrote: | A workgroup/kernel can launch other ones without talking to the | host. Like cuda's dynamic thing except with no nested lifetime | restrictions. This is somewhat documented under the name HSA. | | Involves getting a pointer to a HSA queue and writing a | dispatch packet to it. Same interface the host has for | launching kernels - easier in some ways (you've got the kernel | descriptor as a symbol, not as a name to dlsym) and harder in | others (dynamic memory allocation is a pain). | raphlinus wrote: | Yeah, dynamic memory allocation from GPU space seems to be | the real sticking point. I'll look into HSA queues, that | looks very interesting, thanks. ___________________________________________________________________ (page generated 2023-09-26 23:00 UTC)