[HN Gopher] ROCm is AMD's priority, executive says
       ___________________________________________________________________
        
       ROCm is AMD's priority, executive says
        
       Author : mindcrime
       Score  : 215 points
       Date   : 2023-09-26 17:54 UTC (5 hours ago)
        
 (HTM) web link (www.eetimes.com)
 (TXT) w3m dump (www.eetimes.com)
        
       | halJordan wrote:
       | The first step is admitting there's a problem. So... that's nice.
        
         | ethbr1 wrote:
         | Exactly. People might trust AMD if they continue to invest in
         | this for the next 10 years.
         | 
         | It's clear it wasn't a corporate priority. Convince people it
         | is via sustained action and investment, and _eventually_ they
         | might change their minds.
        
       | clhodapp wrote:
       | If they were serious, they would start something like drm/mesa
       | but for compute and it would just work out of the box with a
       | stock Linux kernel.
        
       | HideousKojima wrote:
       | Only 16 years after Nvidia released CUDA
        
         | grubbs wrote:
         | I remember chatting with some Nvidia rep at CES 2008. He showed
         | me how cuda could be used to accelerate video upscale and
         | encoding. I was 19 at the time and just a hobbyist. I thought
         | that was the coolest thing in the world.
         | 
         | (And yes I "snuck" in to CES using a fake business card to get
         | my badge)
        
           | gdiamos wrote:
           | Back in the day, using CUDA was really hard. It got better as
           | more people built on it and it got battle tested.
        
             | hyperbovine wrote:
             | It's still not exactly easy, and the API has not changed
             | much since the aughts except than to become richer and more
             | complicated. But almost nobody writes raw CUDA anymore.
             | It's abstracted away beneath many layers of libraries, e.g.
             | Flax -> Jax -> lax -> XLA -> CUDA.
        
         | Dah00n wrote:
         | You remind me of one of those kind of people who are part of
         | "team green" or an Apple Fan. People that wish nothing more
         | than to see "the others" fail. A win for their team is good but
         | a fail of the other team is the best thing ever and make them
         | feel all giddy inside.
        
           | jacquesm wrote:
           | What a useless comment. It is you that drives the fire, I
           | would be more than happy with a bit more competition. The sad
           | reality is that right now if you want to focus on your job
           | and not on the intermediary layers that NV is pretty much the
           | only game in town. The 'Team Green' bs came out of the gaming
           | world where people with zero qualifications were facing off
           | with other people with zero qualifications about whose HW was
           | 'the best' when 'the best' meant: I can play games. But this
           | is entirely different, it is about long and deep support of a
           | complex hardware/software combo where whole empires are built
           | upon that support. Those are not decisions made lightly and
           | unfortunately AMD has done very poorly so far. This
           | announcement is great but the proof of the pudding will be in
           | the eating, so let's see how many engineers they dedicate to
           | delivering top notch software.
        
           | HideousKojima wrote:
           | The hilarious thing is I'm actually an AMD fanboy, I've made
           | a point to only get their GPUs (and CPUs) for the last decade
           | or so. But I'm still annoyed and frustrated that it's taken
           | them so long to get their act together on this.
        
       | Havoc wrote:
       | I've concluded they're just allergic to money.
       | 
       | Even after it became very clear that this is going to be big
       | they're still slow off the block as if they're not even trying.
       | 
       | e.g. Why not make a list of the top 500 people in AI field and
       | send them cards no strings attached plus as good of low level
       | documentation as you can muster. Insignificant cost to AMD but
       | could move the mindshare needle if even 20 of the 500 experiment
       | and make some noise about it in their circles.
       | 
       | The Icewhale guys did exactly that best as I can tell. 350k USD
       | hardware kickstarter so really lean. Yet all the youtubers even
       | vaguely in their niche seem to have one of their boards. It's a
       | good board don't get me wrong, but there is no way that was
       | organic. Some sharp marketeer made sure the right people have the
       | gear to influence mindshare.
       | 
       | https://www.youtube.com/results?search_query=zimaboard
        
         | [deleted]
        
         | treprinum wrote:
         | I suspect it's because they don't want to pay for software
         | engineers as hardware engineers are much cheaper. I was
         | contacted by their recruiter last year and it turned out the
         | principal engineer salary was at the level of entry FAANG
         | salary, so I suspect they can't really source the best people.
        
         | jjoonathan wrote:
         | My suspicion is that the GPGPU hardware in shipped cards has
         | known problems / severe limitations due to neglect of that side
         | of the architecture for the last ~10 years. Shipping a bunch of
         | cards only to burn the next generation of AMD compute fans as
         | badly as they burned the last generation of AMD compute fans
         | would _not_ be wise. It 's painful to wait, but it may well be
         | for the best.
        
           | freeone3000 wrote:
           | ROCm on Vega only works on certain motherboards because the
           | card lacks a synchronization clock over the PCI bus. They
           | added it on _some_ later cards. It's absurd how much is
           | lacking and inconsistent.
        
           | gdiamos wrote:
           | Instinct has much better SW support today than Radeon, so you
           | would need to send MI210s/etc .
           | 
           | I think it's at the point where if you are comfortable with
           | GEMM kernels, setting up SLURM, etc it is usable. But if you
           | want to stay at the huggingface layer or higher, you will run
           | into issues.
           | 
           | Many AI researchers are higher level than that these days,
           | but some are still of us willing to go lower level.
        
           | spacecadet wrote:
           | Yeah, this. I tried to do some computing with AMD server
           | grade cards 2 years ago and found all of the API so out of
           | fate and the documentation equally out of date... Went CUDA
           | and didnt look back. Sad, cause Im an AMD fanboy of old.
        
           | tysam_and wrote:
           | It seems like Hotz and co are able to move pretty well on it,
           | so maybe there's some low-level stuff they're using (or maybe
           | they're forced to for a few reasons) w.r.t. the tinybox, but
           | it is impressive how much they've been able to do so far I
           | think. :3 <3 :')))) :')
        
           | simfree wrote:
           | The Radeon MI series seems to perform fine if you follow
           | their software stack happy path. Same for using modified
           | versions of ROCm on APUs, it's just no one has been willing
           | to invest in paying a few developers to work on broader
           | hardware support full-time, thus any bugs outside enterprise
           | Linux distros on Radeon MI series cards do not get triaged.
        
         | roenxi wrote:
         | > e.g. Why not...
         | 
         | A key part of progress is choosing the direction to progress
         | in. Flashy knee-jerk moves like that sound good but it isn't
         | the fastest way to move forward. The first step (which I think
         | they've taken) is for the executives to align on what the
         | market wants. The second is to work out how to achieve it, the
         | third to do it. Handing out freebies would probably help, but
         | it'll take sustained long term strategy for AMD to make money.
         | 
         | AMD's problem isn't low-level developer interest. The George
         | Hotz video rant on AMD was enlightening - the interest is there
         | and the official drivers just don't work. A few years ago I
         | made an effort to get in to reinforcement learning as a hobby
         | and was blocked by AMD crashes. At the time I assumed I'd done
         | something wrong. I still believe that, but I'm less certain
         | now. It is possible that the reason AMD is doing so poorly is
         | just that their code to do BLAS is buggy.
         | 
         | People get very excited about CUDA and maybe everything there
         | is necessary, but on AMD the problem seems to be that the card
         | can't reliably multiply matrices together. I got some early
         | nights using Stable Diffusion because everything worked great
         | for an hour then the kernel paniced. I didn't give AMD any
         | feedback because I run an unsupported card and OS - effectively
         | all cards and OSs are unsupported - but if that is widespread
         | behaviour it would be a grave blocker.
         | 
         | I think they are serious now though. The ROCM documentation
         | dropped a lot of infuriating corporate waffle recently and that
         | is a sign that good people are involved. Still going to wait
         | and see before getting too hopeful that it works out well.
        
           | jacquesm wrote:
           | > Flashy knee-jerk moves like that sound good but it isn't
           | the fastest way to move forward.
           | 
           | NVidia:
           | 
           | - Games -> we're on it
           | 
           | - Machine learning -> we're on it
           | 
           | - Crypto -> we're on it
           | 
           | - LLM / AI -> we're on it
           | 
           | Compare the growth rate of NVidia vs AMD and you get the
           | picture. Flashy knee-jerk moves are bad, identifying growth
           | segments in your industry and running with them is
           | _excellent_ strategy.
           | 
           | People get excited about CUDA _because it works_ , and AMD
           | could have had a very large slice of that pie.
           | 
           | > on AMD the problem seems to be that the card can't reliably
           | multiply matrices together. I got some early nights using
           | Stable Diffusion because everything worked great for an hour
           | then the kernel paniced. I didn't give AMD any feedback
           | because I run an unsupported card and OS - effectively all
           | cards and OSs are unsupported - but if that is widespread
           | behaviour[sic] it would be a grave blocker.
           | 
           | Exactly. And with NVIDIA you'd be working on your problem
           | instead. And that's what makes the difference. AMD should do
           | exactly what the OP wrote: gain mindshare by getting at least
           | some researchers on board with their product, assuming they
           | haven't burned their brand completely by now.
        
             | seunosewa wrote:
             | NVIDIA is focused on graphic cards. AMD has the tough CPU
             | market to worry about.
        
               | jacquesm wrote:
               | That's AMD's problem to solve, they made that choice.
               | 
               | NV doesn't have to worry about resource allocation,
               | branding etc. AMD could copy that by spinning out it's
               | GPU division. Note that 'graphic cards' is no longer a
               | proper identifier either, they just happen to have
               | display connectors on them (and not even all of them).
               | They're more like co-processors that you may also use to
               | generate graphics. But I'm not even sure if that's the
               | bulk of the applications.
        
               | TheCleric wrote:
               | Never half ass two things when you can whole ass one
               | thing.
        
       | gravypod wrote:
       | If this turns around it will be amazing but ROCm isnt the only
       | issue. The entire driver stack is important. If they came out
       | with virtualization support for their gpus (even if everyone paid
       | a 10% perf hit) they'd take over the cheap hosted gpu space which
       | is a huge market.
        
         | mindcrime wrote:
         | Getting proper (and official) ROCm support across their
         | consumer GPU line will be big as well. Hobbyists aren't buying
         | MI300's and their ilk. And surely AMD is better off if a would
         | be hobbyist (or low budget academic/industrial researcher)
         | chooses a Radeon card over something from NVIDIA!
         | 
         | I'm about to buy a high-end Radeon card myself, gambling that
         | AMD is serious about this and will get it right, and that it
         | won't be a wasted purchase. So yeah, if I seem like an AMD fan-
         | boy (I am, somewhat) at least I'm putting my money where my
         | mouth is. :-)
         | 
         |  _AMD's software stacks for each class of product are separate:
         | ROCm (short for Radeon Open Compute platform) targets its
         | Instinct data center GPU lines (and, soon, its Radeon consumer
         | GPUs),_
         | 
         | They've been saying this for a while, and I'm encouraged by
         | reports that people "out there" in the wild have actually
         | gotten this to work with some cards, even in advance of the
         | official support shipping. So here's hoping they are really
         | serious about this point and make this real.
        
           | jauntywundrkind wrote:
           | Apologies for the snark, but maybe it's better that _so far_
           | AMD has had terrible consumer card support. What little
           | hardware they have targeted seems to be barely stable  &
           | barely work for the very limited workloads that are
           | supported. If regular consumers were told their GPUs would
           | work for GPGPU, they might be rotten pissed when they found
           | out what the real state of affairs is.
           | 
           | But if AMD really wants a market impact - which is what this
           | submission is about - getting good support across a decent
           | range of consumer GPUs is absolutely required. They cannot
           | win this ecosystem battle with only datacenter mindshare.
        
           | auggierose wrote:
           | Yeah, don't. Buy an Nvidia and get shit done.
        
             | bryanlarsen wrote:
             | Easier said than done, at least for H100.
        
               | dotnet00 wrote:
               | They're talking about consumer cards, which is the point.
               | You can learn CUDA off any consumer nvidia card and have
               | it translate to the fancier gear, that's part of why
               | nvidia has so much mindshare.
               | 
               | Eg I can write my cuda code with my 3090s, my boss can
               | test it on his laptop's discrete graphics, and then after
               | that we can take the time to bring it to our V100s and
               | A100s and nothing really has to change.
        
             | iforgotpassword wrote:
             | A bit harsh but I agree in that I only believe it when I
             | see it. Have been burned by empty promises by AMD before.
        
             | capableweb wrote:
             | For some people, it's not just about getting results or
             | "get shit done" but about the journey and learning on the
             | way there. Also, AMDs approach to openness tends to be a
             | bit better than NVIDIA, so there's that too. And since
             | we're on _Hacker_ News after all, an AMD GPU for the hacker
             | betting on the future seems pretty fitting.
        
               | bravetraveler wrote:
               | For someone using Linux, an AMD card may be even better
               | suited for 'getting things done'
               | 
               | Wayland and many things _outside of GPGPU_ are much
               | better; ie: power control /gating/monitoring are all
               | available over _sysfs_. You can over /underclock a fleet
               | of systems with traditional config management.
               | 
               | GPGPU surely deserves some weight given the context of
               | the thread, but let's not ignore the warts Nvidia shows
               | elsewhere.
        
             | mindcrime wrote:
             | I get where you're coming from, and in fact I am planning
             | to also build an NVIDIA based ML box as well. But I
             | pointedly want to support AMD here for a variety of
             | reasons, including an ideological bias towards Open Source
             | Software, and a historical affinity for AMD that dates back
             | to the mid 90's.
        
             | Conscat wrote:
             | AMD's debuggers and profilers let you disassemble
             | kernel/shader machine code and introspect registers and
             | instruction latency. That's something at least that Nvidia
             | doesn't do with Nsight tools.
        
         | jauntywundrkind wrote:
         | Virtualization is such a key ability. I really really lament
         | that it's been tucked away, in a couple specific products (The
         | last MxGPU is, what, half a decade old? More? Oh I guess they
         | finally spun off a new one, an RDNA2 V620!).
         | 
         | I keep close & cherish a small hope that for some use-cases we
         | might get a soft virtualization-alike that just works. I don't
         | know enough to say how likely this is to adequately work, but
         | in automotive & some other places there are nested Waylands,
         | designed to share hardware. You still need a shared OS layer, a
         | shared kernel, and a compositor that manages all the
         | subdesktops - this isn't full virtualization - but
         | hypothetically you get something very similar to
         | virtualized/VDI gpus, if you can handle the constraints.
         | 
         | This is really a huge huge huge shift that Wayland has
         | potentially enabled, by actually using kernel resources like
         | DMA-BUFs and what not, where apps can just allocate whatever &
         | pass the compositor filehandles to the bufs. Wayland is ground
         | up, unlike X's top down. So it's just a matter of writing
         | compositors smart enough to push what data from whom needs to
         | get rendered and sent out where.
         | 
         | I would love to know more about what hardware virtualization
         | really buys, know more about the limitations of what VDI is
         | possible in software. But my hope is, in not too long, there's
         | good enough VDI infrastructure that it's basically moot whether
         | a gpu has hardware support. There will be some use cases where
         | yes every users needs to run their own kernel & OS, and that
         | won't be supported (albeit virtio might workaround even that
         | quite effectively), but for 95% of use cases the more modern
         | software stack might make this a non-issue. And at that point,
         | these companies might stop having such expensive-ass product
         | segmentation, charging 3x as much to have a couple hardware
         | virtual devices, since in fact it costs them essentially
         | nothing & the software virtualization is so competitive.
        
       | 01100011 wrote:
       | As far as I understand it, AMD basically has to do this because
       | games are going to increasingly rely on LLMs & generative AI
       | operating simultaneously with the graphics pipeline.
        
         | imbusy111 wrote:
         | It has nothing to do with games. The market outside of games
         | for compute is much bigger at the moment with the AI hype, and
         | AMD is positioned to take a good slice of it, if they get their
         | software stack in order.
        
       | alex21212 wrote:
       | Rocm and amd drives me nuts. The lack of support for consumer
       | cards and the hassle of getting basic things in pytorch to just
       | work was too much.
       | 
       | I was burned by support that never came for my 6800xt. Recently
       | went back to NVIDIA with a 4070 for pytorch.
       | 
       | I hope amd gets their act together with rocm but I'm not going to
       | buy an AMD GPU until they do fix it rather than just vaguely
       | promise to add support some day ...
        
         | zucker42 wrote:
         | Exactly. I recently started a NN side project. The process for
         | setting up PyTorch was to run `pacman -S cuda` and `pip install
         | torch`. I was using a GTX 1060. If it was a project with a
         | bigger budget, I could have rented servers from AWS with all
         | the software preinstalled in no time. I don't even know if it
         | would have been possible for me to do it with AMD, even if I
         | owned an AMD graphics card.
         | 
         | People like me are small potatoes to AMD, but surely it's hard
         | to make significant inroads when it's impossible for anyone to
         | learn or do small projects on ROCM, and big projects can't rely
         | on ROCM just working.
        
           | jacquesm wrote:
           | People like you are small potatoes until you have some
           | measure of success and then suddenly you're burning up GPU
           | hours by the truckload and whatever you're used to you will
           | continue using.
        
       | Tsiklon wrote:
       | I think AMD need to do something BIG in the enterprise space. It
       | seems Nvidia have the Lion's Share of the Market, but Intel have
       | been making good strides there with their DC GPUs.
       | 
       | The software stack is the key here. If the drivers aren't there
       | it doesn't matter what paper capabilities your product has if you
       | can't use it.
       | 
       | AMD have on paper done well with performance in recent
       | generations of consumer cards but their drivers universally seem
       | to be the let down to making the most of their architecture.
        
         | therealmarv wrote:
         | they have! On one of the last keynotes in Summer they announced
         | direct competitor to chips from Nvidia AI chips for
         | enterprises: MI300X
         | 
         | https://www.anandtech.com/show/18915/amd-expands-mi300-famil...
         | 
         | Software stack is crucial of course but if you buy this kind of
         | chips (means you have a lot of money) you probably can also
         | optimise your stack for it for some extra bucks to not rely on
         | Nvidia's supply.
        
       | vegabook wrote:
       | With all due respect this is an insult to those of us who have
       | loyally purchased AMD for numerous years, trying our very best to
       | do compute with days, nay weeks, of attempts.
       | 
       | Now 5 years too late we get told its suddenly their number one
       | priority.
       | 
       | Too late. Not only has all goodwill gone, but it's in deep
       | negative territory. Even 50% lower performance stacks like Intel
       | / Apple are much more appealing than AMD will ever be at this
       | stage.
        
       | capableweb wrote:
       | "senior VP of the AI group at AMD", said at a "AI Hardware
       | Summit" that "My area is AMDs No. 1 Priority".
       | 
       | Tell me when the rest of the company aligns with you and has
       | started to show any results in providing a good experience for
       | people to do machine learning with AMD. As it stands right now,
       | there is so much tooling missing, and the tooling that's there is
       | severely lacking.
       | 
       | But, I have a faith. They've reinvented themselves with CPUs,
       | multiple times, so why not with GPUs, again?
        
         | mindcrime wrote:
         | _Tell me when the rest of the company aligns with you_
         | 
         | More or less the same message has been promulgated[1][2] by no
         | less than Lisa Su[3], FWIW.
         | 
         | [1]: https://www.phoronix.com/news/Lisa-Su-ROCm-Commitment
         | 
         | [2]: https://www.forbes.com/sites/iainmartin/2023/05/31/lisa-
         | su-s...
         | 
         | [3]: https://en.wikipedia.org/wiki/Lisa_Su
        
       | no_wizard wrote:
       | The inevitable fight here is between ROCm which may have, 100s of
       | AMD engineers working on it and related verticals, at best,
       | without significant changes at the company, plus whatever
       | contributions they can muster from the community.
       | 
       | I think at least headcount check, CUDA had _thousands_ of
       | engineers working on it and related verticals.
       | 
       | I know there's a philosophy that states, eventually, open source
       | eats everything, however, this one seems like there is so much
       | catch up that AMD will need to spend big and fast to get off the
       | ground competitively.
        
         | [deleted]
        
       | martinald wrote:
       | It's absolutely mindboggling to me that AMD is still struggling
       | so badly on this.
       | 
       | There is an absolutely enormous market for AMD GPUs for this, but
       | they seem to be completely stuck on how to build a developer
       | ecosystem.
       | 
       | Why aren't AMD throwing as many developers as possible submitting
       | PRs for the open source LLM effort adding ROCm support, for
       | example?
       | 
       | It would give AMD real world insights to the problems with their
       | drivers and SDKs as well, which are incredibly numerous.
       | 
       | People would be willing to overlook a huge amount of jank for
       | cheap(er) cards with large VRAM configurations. I don't think
       | they when need to be particularly fast, just have the VRAM
       | needed, which I'm sure AMD could put specialist cards together
       | for.
        
         | hedgehog wrote:
         | Historically they believed that "the community" would address
         | broader ML software support. I think the idea was they could
         | assign dedicated engineers for bigger customers and together
         | that was a sort of Pareto-goodish solution given their
         | constraints as a company. Even in retrospect I'm not sure if
         | that was a good call or not.
        
           | Almondsetat wrote:
           | I mean, they _would_ be right if all their cards, both
           | consumer and enterprises, supported the same programming
           | interface.
           | 
           | You cannot trust the community to do the work for you but
           | then only make the software available for $Xk dollar cards
        
       | ryukoposting wrote:
       | s/OpenCL/ROCm/g
        
       | pixelpoet wrote:
       | Oh man, this is exactly what I want to see on HN frontpage!
       | 
       | I commented on another article about an AMD chip that had no
       | OpenCL support that it made it dead in the water for me, and was
       | downvoted; surely everyone understands how important CUDA is, and
       | everyone should understand how important open standards are (e.g.
       | FreeSync vs Nvidia's GSync), so I can't understand why more
       | people don't share my zeal for OpenCL.
       | 
       | I've shipped two commercial products based on it which still
       | works perfectly today on all 3 desktop platforms from all GPU
       | vendors... what's not to love?
        
         | tysam_and wrote:
         | If they can make a 288 GB $4.4-6.8k prosumer, home-computer-
         | friendly graphics card, I will be extremely happy. Might be a
         | pipe dream (today at least, lol, and standard in like...what, 5
         | years?), but if they can pull that off, then I think things
         | would really change a lot.
         | 
         | I don't care if it's slow, bottom-of-the-barrel GDDR6, or
         | whatever, just being able to enter the high-end model
         | finetuning & training regime for ML models on a budget
         | _without_ dilly-dallying with multiple graphics cards (a
         | monstrous pain-in-the-neck from a software, engineering, &
         | experimentation perspective)_ would enable so much large-scale
         | development work to happen.
         | 
         | The compute is extremely important, and in most day-to-day
         | usecases, the memory bandwidth even moreso, but boy oh boy
         | would I love to enter the world offered by a large unified card
         | architecture.
         | 
         | (Basically, in my experience, parallelizing a model across
         | multiple GPUs is like compiling from code to a binary --
         | technically you can 'edit' it, but it's like directly hex
         | editing strings in a binary blob, extremely limited. Hence why
         | I try to stick with models that take only a few seconds
         | (minutes at most) to train on highly-representative tasks,
         | distill first principles, and then expand and exploit that to
         | other modalities from there).
        
         | Conscat wrote:
         | OpenCL isn't very useful now that we have Vulkan. Its biggest
         | advantage is that there exist C++ compilers for its kernels.
         | But AMD's OpenCL runtime inserts excessive memory barriers not
         | required by the spec (they won't fix this due to Hyrum's Law)
         | and Vulkan gives you more control over the memory allocation
         | and synchronization anyways. If we had better Vulkan shader
         | compilers, OpenCL would serve basically no purpose, at least
         | for AMD hardware.
        
           | cpill wrote:
           | AI libs could use it and we'd break the bonds in CUDA. Also
           | Rust might get an implementation which would give it they
           | non-intervention to overtake C++
        
             | pjmlp wrote:
             | No it wouldn't, until it provides the same polyglot support
             | and graphical tooling as CUDA.
             | 
             | At least Intel is trying with oneAPI into that direction.
        
           | raphlinus wrote:
           | Yeah, that's a big if. In theory there's nothing preventing
           | good compilation to Vulkan compute shaders, in practice
           | people just aren't doing it, as CUDA actually works today.
           | 
           | I also agree that Vulkan is more promising than OpenCL. With
           | recent extensions, it has real pointers (buffer device
           | address), cooperative matrix multiplication (also known as
           | tensor cores or WMMA), scalar types other than 32 bits,
           | proper barrier (including device-scoped, needed for single
           | pass scan), and other important features.
        
           | 20k wrote:
           | Its not that they're supporting buggy code, they just
           | downgraded the quality of their implementation significantly.
           | They made the compiler a lot worse when they swapped to rocm
           | 
           | https://github.com/RadeonOpenCompute/ROCm-OpenCL-
           | Runtime/iss... is the tracking issue for it filed a year ago,
           | which appears to be wontfix largely because its a lot of work
           | 
           | OpenCL still unfortunately supports quite a few things that
           | vulkan doesn't, which makes swapping away very difficult for
           | some use cases
        
         | parl_match wrote:
         | > I can't understand why more people don't share my zeal for
         | OpenCL.
         | 
         | When I last worked with it, it was difficult, unstable, and
         | performed poorly. CUDA, on the other hand, has been nothing but
         | good (at least). Well, nvidia pricing aside ;)
         | 
         | OpenCL might be a lot better now, but for a lot of us, we
         | remember when it was actively a bad choice.
        
         | Vvector wrote:
         | But is this just more BS from AMD?
         | 
         | https://www.bit-tech.net/reviews/tech/cpus/amd-betting-every...
         | AMD Betting Everything on OpenCL (2011)
        
           | jjoonathan wrote:
           | I'm pretty sure the NVDA pump finally convinced the AMD board
           | / C-Suite to prioritize this, but it takes time to steer a
           | big ship. I'm hopeful, but there are still bad incentives to
           | jump the gun on announcements so I'll let others take the
           | plunge first.
        
         | kldx wrote:
         | > I've shipped two commercial products based on it which still
         | works perfectly today on all 3 desktop platforms from all GPU
         | vendors... what's not to love?
         | 
         | In my experience, if commercial products involved any sort of
         | hand-optimized, proprietary OpenCL, one would be shocked by the
         | lack of documentation and zero consistency across AMD's GPUs.
         | Intel has SPIRV and Nvidia has PTX and this works pretty well.
         | But some AMD cards support SPIR or SPIRV, and some don't and
         | this support matrix keeps changing over time without a single
         | source of truth.
         | 
         | Throw in random segfaults inside AMD's OpenCL implementation
         | and you have a fun day debugging!
         | 
         | Dockerizing OpenCL on AMD is another nightmare I don't want to
         | get into. Intel is literally installing the compute runtime and
         | mapping `/dev/dri` inside the container. On paper, AMD has the
         | same process but in reality I had to run `LD_DEBUG=binding` so
         | many times just to figure out why AMD runtime breaks inside
         | docker.
         | 
         | There may be great upsides to AMD's hardware in other domains
         | though
        
         | jjoonathan wrote:
         | For a long time, AMD promoted OpenCL as viable without it
         | actually being viable. This leaves scars and resentment. Mine
         | come from about 10 years ago. They run deep.
         | 
         | I'm glad to hear your experience was better, but I'm fresh out
         | of trust. This time, I need to see major projects in my
         | application areas working on AMD _before_ I buy, because AMD
         | has taught me that  "trust us" and "just around the corner" can
         | mean "10 years later and it still hasn't happened." I'm pretty
         | sure that this time _is_ different, but the green tax is dirt
         | cheap compared to learning this lesson the hard way, so I 'm
         | letting others jump first this time.
        
       | gdiamos wrote:
       | Relevant, we deployed Lamini on hundreds of MI200 GPUs.
       | 
       | Lisa tweet: https://x.com/LisaSu/status/1706707561809105331?s=20
       | 
       | Lamini tweet:
       | https://x.com/realSharonZhou/status/1706701693684154766?s=20
       | 
       | Blog: https://www.lamini.ai/blog/lamini-amd-paving-the-road-to-
       | gpu...
       | 
       | Register:
       | https://www.theregister.com/2023/09/26/amd_instinct_ai_lamin...
       | CRN: https://www.crn.com/news/components-peripherals/llm-
       | startup-...
       | 
       | The hard part about using any AI Chips other than NVIDIA has been
       | software. ROCm is finally at the point where it can train and
       | deploy LLMs like Llama 2 in production.
       | 
       | If you want to try this out, one big issue is that software
       | support is hugely different on Instinct vs Radeon. I think AMD
       | will fix this eventually, but today you need to use Instinct.
       | 
       | We will post more information explaining how this works in the
       | next few weeks.
       | 
       | The middle section of the blog post above includes some details
       | including GEMM/memcpy performance, and some of the software
       | layers that we needed to write to run on AMD.
        
         | mardifoufs wrote:
         | What's the cost benefit vs. Nvidia? Is it cheaper?
        
           | light_hue_1 wrote:
           | You simply cannot buy nvidia GPUs at scale at the moment.
           | We're getting quotes that are many months out, sometimes even
           | a year+ out.
        
             | gdiamos wrote:
             | We kept hearing 52 weeks for new shipments.
        
           | gdiamos wrote:
           | Available in orders of up to 10,000 GPUs today - no shortage
           | 
           | More than 10x cheaper than allocating machines on a tier 1
           | cloud - AWS, Azure, GCP, Oracle, etc
           | 
           | More memory - 128GB HBM per GPU - means bigger models fit for
           | training/inference without the nightmare of model parallelism
           | over MPI/infiniband/etc
           | 
           | Longer term - finetuning optimizations
        
             | mardifoufs wrote:
             | Ah! The memory sounds interesting. How would that compare
             | to similar Nvidia hardware w.r.t cost assuming the hardware
             | was available?
             | 
             | Does AMD provide something similar to nvlink, and even
             | libraries like cudnn?
             | 
             | Also, last I checked none of the public clouds offered any
             | of the latest gens MI GPUs, so I wasn't aware that it had
             | good availability! Azure had a preview but I'll look more
             | into it now.
             | 
             | Thank you for your answer btw!
        
               | gdiamos wrote:
               | Yeah getting around the no public cloud thing was really
               | annoying. We had to build our own datacenter.
               | 
               | On the plus side, it was drastically cheaper and now we
               | can just slot in machines.
               | 
               | I would prefer that a tier 1 cloud made MI GPUs available
               | though. It would make it so much more accessible.
        
               | gdiamos wrote:
               | See the memory size comparison (GB) in this table: https:
               | //en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...
        
               | tbihl wrote:
               | It blows my mind that A100 and H100 are each safely below
               | 1000W power draw.
        
           | gardnr wrote:
           | The classic economic benefits of competition:
           | 
           | * Drives down price
           | 
           | * Enhances product features (I see them competing on VRAM
           | first)
           | 
           | * Helps to insulates buyers from supply issues
           | 
           | Nvidia has kneecapped their consumer grade hardware to ensure
           | the gaming market still has scraps to buy in spite of crypto
           | mining and the AI gold rush. All AMD would have to do to eat
           | into Nvidia marketshare is remove the hardware locks in low-
           | end cards and ship one with 64GB+ of VRAM.
           | 
           | This of course would only work if they have comparable/usable
           | software support. Any improvements to ROCm will be a boon for
           | any company that doesn't already have or can't afford huge
           | farms of high-end Nvidia chips.
        
         | jauntywundrkind wrote:
         | > _If you want to try this out, one big issue is that software
         | support is hugely different on Instinct vs Radeon. I think AMD
         | will fix this eventually, but today you need to use Instinct._
         | 
         | I'm really really worried about AMD, and whether they're going
         | to care about anyone else. They might just care about Instinct,
         | where margins are so high, and ignore consumer cards or making
         | more friction and segmentation for consumer cards.
         | 
         | Part of what made CUDA so successful was that the low hardware
         | barrier to entry created such a popular offering. Everyone used
         | it. I really hope AMD realizes that, and really hope AMD
         | invests in consumer card software too. Just making it work on
         | the high end doesn't seem enough to get the kind of mass-
         | movement ecosystem success AMD really needs. I'm afraid they
         | might go for a smaller win, try to compete only at the top.
        
         | dotnet00 wrote:
         | It's nice to hear that there are actual results to show, since
         | AMD execs simply saying that ROCm is a priority isn't really
         | convincing anymore given their track record on claims regarding
         | support on the consumer side.
        
           | viewtransform wrote:
           | The difference this time is that the executive is from
           | Xilinx. Xilinx has had an AI software development team for a
           | while in the FPGA space.
           | 
           | AMD has had poor management in the GPU computing space since
           | Raja Koduri's time (he put the best engineering resources on
           | VR during his tenure and ignored deep learning). Subsequent
           | directors have not had a long term vision and left within a
           | few years.
           | 
           | Looks like Lisa Su has corrected this now - they seem to have
           | moved AMD software engineers en masse to work under Xilinx
           | management on AI. Remains to be seen if this new management
           | hierarchy will have a better vision and customer focus.
        
             | varelse wrote:
             | [dead]
        
         | tbruckner wrote:
         | I would really hope you could get decent utilization on ops as
         | fundamental as GEMM/memcpy on a single device. Translating that
         | to MFU is a completely different story.
        
           | gdiamos wrote:
           | We get good utilization at scale as well. Typically 30-40% of
           | peak at the full application level for training and
           | inference.
           | 
           | Perf isn't the biggest problem though, many AI chips can do
           | this or a bit better on benchmarks, if you invest the
           | engineering time to tune the benchmark.
           | 
           | The really hard part is getting a complete software stack
           | running.
           | 
           | It took us over 3 years because many of the layers just
           | didn't exist, e.g. scale out LLM inference service that
           | supports multiple requests with fine-grained batching across
           | models distributed over multiple GPUs.
           | 
           | On Instinct, ROCm gets you the ability to run most pytorch
           | models on one GPU assuming you get the right drivers,
           | compilers, framework builds, etc.
           | 
           | That's a good start, but you need more to serve a real
           | application.
        
             | mgaunard wrote:
             | People have been using their GPGPUs for decades on a
             | variety of scientific applications, and there are all kinds
             | of hybrid and multi-device frameworks that exist (often
             | supporting multiple backends).
             | 
             | The difference is that it didn't get a lot of love as part
             | of the overhyped python LLM movement.
        
               | gdiamos wrote:
               | Completely agree, I'd love to see some of the innovations
               | from HPC move over into their LLM stack.
               | 
               | We are working on it, but it takes time.
               | 
               | Contributions to foundational layers like ROCBlas,
               | pytorch, slurm, Tensile, huggingface, etc would help.
        
       | dauertewigkeit wrote:
       | With all this hype about CUDA, I have recently started looking
       | into programming CUDA as a job as I love that kind of challenge,
       | but to my dismay I found that these tasks are very niche. So it
       | is not even that people are routinely writing new CUDA code. It's
       | just that the current corpus is too big and comprehensive for
       | alternatives to compete with.
        
         | jacquesm wrote:
         | That and a massive amount of experience already out there on
         | how to optimize for that particular architecture. NVidia has
         | done well for itself on the back of four sequential very good
         | bets coupled with dedication unmatched by any other vendor,
         | both on the hardware and on the software side. It also was one
         | of the few times that I didn't care if I ran the vendor
         | supplied closed source stuff because it seemed to work just
         | fine and I never had the feeling they would suddenly drop
         | support for my platform.
        
         | coder543 wrote:
         | Specialized skills can have a fairly small job market
         | sometimes. I think a lot of CUDA code ends up being
         | foundational as part of popular libraries, supporting tons of
         | applications that never need to write a single line of CUDA
         | themselves.
        
       | ckastner wrote:
       | The Debian ROCm Team [1] has made quite a bit of progress in
       | getting the ROCm stack into the official Archive.
       | 
       | Most components are already packaged, the next big target is
       | adding support to the PyTorch package.
       | 
       | Many of the packages are older versions; this is because getting
       | broad coverage was prioritized. The other next big target that is
       | currently being worked on is getting full ROCm 5.7 support.
       | 
       | I fully expect Debian 13 (trixie) to come with full ROCm support
       | out-of-the-box, and as a consequence, also derivatives to have
       | support (Ubuntu above all). In fact, there will almost certainly
       | be backports of ROCm 5.7 to Debian 12 (bookworm) within the next
       | few months, so one will be able to just                 $ sudo
       | apt-get install pytorch-rocm
       | 
       | One current obstacle is infrastructure: the Debian build and CI
       | infrastructures (both hardware and software) were not designed
       | with GPUs in mind. This is also being worked on.
       | 
       | Edit: forgot to say that the CI infra that the Team is setting up
       | here tests all of these packages on consumer cards, too. So while
       | there may not be _official_ support for most of these, upstream
       | tests passing on the cards within the infra should be a good
       | indication for _practical_ support.
       | 
       | [1] https://salsa.debian.org/rocm-team/
        
         | avcxz wrote:
         | I'd also like to point out that ROCm has been packaged for Arch
         | Linux since the beginning of 2023, with efforts starting since
         | March 2020 [1].
         | 
         | Currently on Arch Linux you can run the following successfully:
         | $ sudo pacman -S python-pytorch-rocm
         | 
         | Arch Linux even has ROCm support with blender.
         | 
         | [1] https://github.com/rocm-arch
        
       | mgaunard wrote:
       | AMD has a history of providing sub-par software, and their
       | strategy of (partially) opening up their specifications and have
       | other people write it for free didn't work either.
       | 
       | Nvidia has huge software teams, and so does Intel.
        
         | mindcrime wrote:
         | I don't know if they'll ultimately succeed or not, but they at
         | least seem to be putting genuine effort into this. ROCm
         | releases are coming out at a relatively nice clip[1], including
         | a new release just a week or two ago[2].
         | 
         | [1]: https://github.com/RadeonOpenCompute/ROCm/releases
         | 
         | [2]: https://www.phoronix.com/news/AMD-ROCm-5.7-Released
        
           | Vvector wrote:
           | Yeah, AMD is doing more with ROCm. But are they catching up
           | to Nvidia, or just not falling behind as fast as before? Only
           | time will tell
        
         | dagw wrote:
         | Not only sub-par software, but sub-par software that they drop
         | support for after a couple of years. People can work around the
         | problems with sub-par software if they believe that it will
         | benefit them long term. They will absolutely not put in the
         | effort if they fear it will be completely useless in 2 years
         | time.
        
       | raphlinus wrote:
       | ROCm makes me sad, as it reminds me of how much better GPUs could
       | be than they are today.
       | 
       | I've lately been exploring the idea of a "Good Parallel
       | Computer," which combines most of the agility of a CPU with the
       | efficient parallel throughput of a GPU. The central concept is
       | that the decision to launch a workgroup is made by a programmable
       | controller, rather than just being a cube of (x, y, z) or
       | downstream of triangles. A particular workload it would likely
       | excel at is sparse matrix multiplication, including multiple
       | quantization levels like SpQR[1]. I'm hopeful that it could be an
       | advance in execution model, but also a simplification, as I
       | believe a lot of the complexity of the current GPU model is
       | because of lots of workarounds for the weak execution model.
       | 
       | I'm not optimistic about this being built any time soon, as it
       | requires rethinking the software stack. But it's fun to think
       | about. I might blog about it at some point, but I'm also
       | interested in connecting with people who have been thinking along
       | similar lines.
       | 
       | [1]: https://arxiv.org/abs/2306.03078
        
         | johncolanduoni wrote:
         | How does this differ from CUDA's dynamic parallelism, which
         | lets you launch kernels from within a kernel?
        
           | raphlinus wrote:
           | There are a lot of similarities, but the granularity is
           | finer. The idea is that you make a decision to launch one
           | workgroup (typically 1024 threads) when the input is
           | available, which would typically be driven by queues, and
           | potentially with joins as well, which is something the new
           | work graph stuff can't quite do. Otherwise the idea of stages
           | running in parallel, connected by queues, is similar. But I
           | did an analysis of work graphs and came to the conclusion
           | that it wouldn't help with the Vello (2d vector graphics)
           | workload at all.
        
         | JonChesterfield wrote:
         | A workgroup/kernel can launch other ones without talking to the
         | host. Like cuda's dynamic thing except with no nested lifetime
         | restrictions. This is somewhat documented under the name HSA.
         | 
         | Involves getting a pointer to a HSA queue and writing a
         | dispatch packet to it. Same interface the host has for
         | launching kernels - easier in some ways (you've got the kernel
         | descriptor as a symbol, not as a name to dlsym) and harder in
         | others (dynamic memory allocation is a pain).
        
           | raphlinus wrote:
           | Yeah, dynamic memory allocation from GPU space seems to be
           | the real sticking point. I'll look into HSA queues, that
           | looks very interesting, thanks.
        
       ___________________________________________________________________
       (page generated 2023-09-26 23:00 UTC)