[HN Gopher] C-for-Metal: High Performance SIMD Programming on In... ___________________________________________________________________ C-for-Metal: High Performance SIMD Programming on Intel GPUs Author : lelf Score : 53 points Date : 2021-01-29 09:27 UTC (13 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | raphlinus wrote: | Another interesting reference from a few years ago: | http://www.joshbarczak.com/blog/?p=1028 | | Also read the followups (1120 and 1197), as they go into | considerably more detail about the SPMD programming model and | some use cases. | | The author is now at Intel working on ray tracing. | astrange wrote: | Intel has a previous SPMD compiler here: https://ispc.github.io | | Although the author seemed to have fled Intel soon after | releasing it, and apparently spent the whole development | process terrified that corporate politics would make him cancel | it. | einpoklum wrote: | > The SIMT execution model is commonly used for general GPU | development. CUDA and OpenCL developers write scalar code that is | implicitly parallelized by compiler and hardware. On Intel GPUs, | however, this abstraction has profound performance implications | as the underlying ISA is SIMD and important hardware capabilities | cannot be fully utilized | | What? That makes no sense. | | GPU processor cores are basically just SIMD with a different | color hat. The SASS assebly simply has _only_ SIMD instructions - | and with the full instrunction set being SIMD'ized, it can drop | the mention of "this is SIMD" and just pretend individual lanes | are instruction-locked threads . | | So, an OpenCL compiler would do very similar parallelization on a | GPU and on an Intel CPU. (It's obviously not exactly the same | since the instruction sets do differ, and the widths are not the | same, and Intel CPUs has different widths which could all act at | the same time etc.) | | So, the hardware capabilities can be utilized just fine. | 37ef_ced3 wrote: | Domain-specific compilers that generate explicit SIMD code from a | high-level specification are even nicer. These can fully exploit | the capabilities of the instruction set (e.g., fast permutes, | masking, reduced precision floats, large register file, etc.) for | a particular domain | | For example, generating AVX-512 code for convnet inference: | https://NN-512.com | | NN-512 does four simultaneous 8x8 Winograd tiles (forward and | backward) in the large AVX-512 register file, accelerates strided | convolutions by interleaving Fourier transforms (again, with | knowledge of the large register file), makes heavy use of the two | input VPERMI2PS permutation instructions, generates simplified | code with precomputed masks around tensor edges, uses | irregular/arbitrary tiling patterns, etc. It generates code like | this: | | https://nn-512.com/example/11 | | This kind of compiler can be written for any important domain | the_optimist wrote: | This is great, but it doesn't address GPUs. If you built it for | GPUs, from what I understand, that outcome would basically look | like tensorflow, or maybe tensorflow XLA. Is that right? | 37ef_ced3 wrote: | My point is that a less general compiler can yield better | SIMD code for a particular domain, and be easier to use for a | particular domain. And I gave a concrete illustration | (NN-512) to support that claim | | Consider NN-512 (less general) versus Halide (more general). | Imagine how hard it would be to make Halide generate the | programs that NN-512 generates. It would be a very | challenging problem | the_optimist wrote: | Understood: NN-512 is a local optimum in an optimization of | hardware and problem structure. | the_optimist wrote: | Compiling from high-level lang to GPU is a huge problem, and we | greatly appreciate efforts to solve it. | | If I understand correctly, this (CM) allows for C-style fine- | level control over a GPU device as though it were a CPU. | | However, it does not appear to address data transit (critical for | performance). Compilation and operator fusing to minimize transit | is possibly more important. See Graphcore Poplar, Tensorflow XLA, | Arrayfire, Pytorch Glow, etc. | | Further, this obviously only applies to Intel GPUs, so investing | time in utilizing low-level control is possibly a hardware dead- | end. | | Dream world for programmers is one where data transit and | hardware architecture are taken into account without living | inside a proprietary DSL Conversely, it is obviously against | hardware manufacturers' interests to create this. | | Is MLIR / LLVM going to solve this? This list has been | interesting to consider: | | https://github.com/merrymercy/awesome-tensor-compilers | baybal2 wrote: | > is possibly a hardware dead-end. | | I'm thinking the opposite, there been an unending succession of | different accelerators for doing this, and that which | eventually been obsoleted, and forgotten when general purpose | CPU caught up to them in performance, or comp-sci learned how | to do calculations more efficiently on mainstream hardware. | | Just by seeing how morbid are the sales of the new "NPUs," I | can guess it's already happening. | | A number of cellphone brands experimented with such to run | selfie filters, or do speech recognition, but later found that | those work on CPUs not any worse at all if competent | programmers are hired, and then threw the NPU hardware out, or | stopped using them. | banachtarski wrote: | I'm not a hardware engineer, but I am a GPU-focused graphics | engineer. | | > C-style fine-level control over a GPU device as though it | were a CPU. | | Personally, I think this is a fool's errand, and this has | nothing to do with my desire for job security or anything. When | I look at how code in the ML world is written for a GPU for | example, it's really easy to see why it's so slow. The CPU and | GPU architectures are fundamentally different. Different | pipelining architecture, scalar instead of vector, 32/64-wide | instruction dispatches, etc. HLSL/GLSL and other such shader | languages are perfectly "high level" with other needed | intrinsics needed to perform relevant warp level barriers, wave | broadcasts/ballots/queries, use LDS storage, execute device | level barriers, etc. This isn't to say that high level shader | language improvements aren't welcome, but that trying to | emulate a CPU is an unfortunate goal. | mpweiher wrote: | What kinds of improvements would you like to see? | moonbug wrote: | ain't no one gonna use that. | skavi wrote: | Why are Intel GPUs designed in such a way that typical GPU | languages don't fully exploit it? Is the new Xe architecture | still SIMD? | dragontamer wrote: | OpenCL works on Intel GPUs, while CUDA doesn't because CUDA is | an NVidia technology. | | > Is the new Xe architecture still SIMD? | | SIMD is... pretty much all GPUs do. There's a few scalar bits | here and there to speed up if-statements and the like, but the | entire point of a GPU is to build a machine for SIMD. | astrange wrote: | GPUs don't need to have SIMD instructions; if you give one a | fully scalar program it just needs to run a lot of copies of | it at once. Every architecture is different here, including | within the same vendor. | dragontamer wrote: | > GPUs don't need to have SIMD instructions; | | Except NVidia Ampere (RTX 3xxx series) and AMD RDNA2 (Navi | / 6xxx series) are both SIMD architectures with SIMD- | instructions. | | And the #3 company: Intel, also has SIMD instructions. I | know that some GPUs out there are VLIW or other weird | architectures, but... the "big 3" are SIMD-based. | | > if you give one a fully scalar program it just needs to | run a lot of copies of it at once. | | Its emulated on a SIMD processor. That SIMD processor will | suffer branch-divergence as you traverse through if- | statements and while-statements, because its physically | SIMD. | | The compiler / programming model is scalar. But the | assembly instructions are themselves vector. Yeah, NVidia | now has per-SIMD core instruction pointers. But that | doesn't mean that the hardware can physically execute | different instructions: they're still all locked together | with SIMD-style at the physical level. | raphlinus wrote: | That's partly true, but there are exceptions, of which the | subgroup operations are the most obvious. These are roughly | similar to broadcast and permute SIMD instructions, and in | some cases can lead to dramatic speedups. | oivey wrote: | OpenCL is basically dead at this point, too. The de facto | standard is CUDA and there aren't currently any real | challengers. Maybe eventually AMD's ROCm or Intel's oneAPI | will get traction. | pjmlp wrote: | For them to get traction, they need to invest in debugger | tooling that allows the productivity as on CPUs, and to | help language communities other than C and C++ to target | GPGPUs. | | NVidia started doing both around CUDA 3.0, whereas Khronos, | AMD and Intel only started paying attention that not | everyone wanted to do printf() style debugging with a C | dialect until it was too late to get people's attention | back. | profquail wrote: | oneAPI uses DPC++ (Data-Parallel C++), which is pretty much | just SYCL, which itself is a C++ library on top of OpenCL. | | From my understanding, the Khronos group realized OpenCL | 2.x was much too complicated so vendors just weren't | implementing it, or only implementing parts of it, so they | came up with OpenCL 3.0 which is slimmed-down and much more | modular. It's hard to say how much adoption it'll get, but | with Intel focused on DPC++ and oneAPI now, there will | definitely be more numerical software coming out in the | next few years that compiles down to and runs on OpenCL. | | For example, Intel engineers are building a numpy clone on | top of DPC++, so unlike regular numpy it'll take advantage | of multiple CPU cores: https://github.com/IntelPython/dpnp | astrange wrote: | > From my understanding, the Khronos group realized | OpenCL 2.x was much too complicated so vendors just | weren't implementing it, or only implementing parts of | it, so they came up with OpenCL 3.0 which is slimmed-down | and much more modular. | | Something like this also happened to OpenGL 4.3. It added | a compute shader extension which was essentially all of | OpenCL again, except different, so you had 2x the | implementation work. This is about when some people | stopped implementing OpenGL. | TazeTSchnitzel wrote: | OpenGL compute shaders are a natural step if you have | unified programmable shader cores, and less complicated | than adding new pipeline stages for everything | (tessellation shaders, geometry shaders, ...). | | Khronos could have chosen only to add OpenCL integration, | but OpenCL C is a very different language to GLSL, the | memory model (among other things) is different, and so | on. I don't see why video game developers should be | forced to use OpenCL when they want to work with the | outputs of OpenGL rendering passes, to produce inputs to | OpenGL rendering passes, scheduled in OpenGL, to do | things that don't fit neatly into vertex or fragment | shaders? | pjmlp wrote: | Kind of right. | | DPC++ has more stuff than just SYSCL, some of it might | find its way back to SYSCL standardization, some of it | might remain Intel only. | | OpenCL 3.0 is basically OpenCL 1.2 with a new name. | | Meanwhile people are busy waiting for Vulkan compute to | take off, got to love Khronos standards. | TazeTSchnitzel wrote: | Some people are working on being able to run OpenCL | kernels on Vulkan: https://github.com/google/clspv | pjmlp wrote: | Sure, but it will take off to actually matter? | | So far I am only aware of Adobe using it to port their | shaders to Vulkan on Android. ___________________________________________________________________ (page generated 2021-01-29 23:00 UTC)