[HN Gopher] Metal shader converter and the missing device-scoped... ___________________________________________________________________ Metal shader converter and the missing device-scoped barrier Author : raphlinus Score : 25 points Date : 2023-06-12 18:40 UTC (1 days ago) (HTM) web link (raphlinus.github.io) (TXT) w3m dump (raphlinus.github.io) | tedunangst wrote: | So how [well] does MoltenVK work? The prevailing attitude I've | seen is basically "just target vulkan for everything because it | just works" but I'm not sure how much experience is reflected in | such claims. | raphlinus wrote: | If you're doing advanced compute work (including lock-free data | structures), then it's best effort. | | https://github.com/linebender/vello/issues/42 is an issue from | when Vello (then piet-gpu) had a single-pass prefix sum | algorithm. Looking back, I'm fairly confident that it's a | shader translation issue and that it wouldn't work with | MoltenVK either, but we stopped investigating when we moved to | a more robustly portable approach. | bronxbomber92 wrote: | I believe this post is referring to device-scoped _memory_ | barriers - also sometimes called fences - as opposed to | _execution_ barriers. | | The former being a mechanism to ensure memory accesses follow a | well defined order (e.g. it'd be bad if the memory accesses | executed inside a critical section could be reordered before or | after the lock and unlock calls). | | The latter being a mechanism that ensures all threads (within | some scope, perhaps all threads running on the "device") reach | the same point in the program before any are allowed to proceed. | raphlinus wrote: | That's correct, it's the _memory scope_ that I expect to be | device-scoped. GPUs tend not to have execution barriers in the | shader language beyond workgroup scope; generally the next | coarser granularity for synchronization is a separate dispatch. | However, single-pass prefix sum algorithms, including decoupled | look-back, can function just fine with device-scoped memory | barriers, and do not require execution barriers with coarser | scope than workgroup. | Animats wrote: | Apple having to Think Different mean we need about two more | layers in portable games. | richdodd wrote: | Does the M1/M2 use ARM designs in the GPU as well as the CPU? If | so, it might be possible to work out what could be implemented by | looking at the [arm docs](https://developer.arm.com/documentation | /102203/0100/Valhall-...). | richdodd wrote: | Hmm OK according to the doucmentation they designed the GPU | themselves, so there's no public information on them. | nicoburns wrote: | No, they have a custom GPU design originally derived from | Imagination Technologies PowerVR GPUs. | raphlinus wrote: | The most complete documentation is in the applegpu repo[1] by | dougallj showing a great deal of recent activity (including by | alyssarosenzweig). Last I checked, the documentation of barrier | instructions wasn't complete enough to tell whether these | device-scoped barriers are possible. (Note: on RDNA2, they're | accomplished by DLC and GLC flags on memory accesses, combined | with cache flush instructions such as S_GL1_INV). | | There's also a lot of great material, accessibly written, on | Alyssa's blog[2], see in particular the posts titled | "Dissecting the Apple M1 GPU, part ${I}". | | [1]: https://github.com/dougallj/applegpu | | [2]: https://rosenzweig.io/ | DeRock wrote: | Apple doesn't use ARM IP for either, and hasn't for many years. ___________________________________________________________________ (page generated 2023-06-13 23:01 UTC)