[HN Gopher] C-for-Metal: High Performance SIMD Programming on In...
       ___________________________________________________________________
        
       C-for-Metal: High Performance SIMD Programming on Intel GPUs
        
       Author : lelf
       Score  : 53 points
       Date   : 2021-01-29 09:27 UTC (13 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | raphlinus wrote:
       | Another interesting reference from a few years ago:
       | http://www.joshbarczak.com/blog/?p=1028
       | 
       | Also read the followups (1120 and 1197), as they go into
       | considerably more detail about the SPMD programming model and
       | some use cases.
       | 
       | The author is now at Intel working on ray tracing.
        
         | astrange wrote:
         | Intel has a previous SPMD compiler here: https://ispc.github.io
         | 
         | Although the author seemed to have fled Intel soon after
         | releasing it, and apparently spent the whole development
         | process terrified that corporate politics would make him cancel
         | it.
        
       | einpoklum wrote:
       | > The SIMT execution model is commonly used for general GPU
       | development. CUDA and OpenCL developers write scalar code that is
       | implicitly parallelized by compiler and hardware. On Intel GPUs,
       | however, this abstraction has profound performance implications
       | as the underlying ISA is SIMD and important hardware capabilities
       | cannot be fully utilized
       | 
       | What? That makes no sense.
       | 
       | GPU processor cores are basically just SIMD with a different
       | color hat. The SASS assebly simply has _only_ SIMD instructions -
       | and with the full instrunction set being SIMD'ized, it can drop
       | the mention of "this is SIMD" and just pretend individual lanes
       | are instruction-locked threads .
       | 
       | So, an OpenCL compiler would do very similar parallelization on a
       | GPU and on an Intel CPU. (It's obviously not exactly the same
       | since the instruction sets do differ, and the widths are not the
       | same, and Intel CPUs has different widths which could all act at
       | the same time etc.)
       | 
       | So, the hardware capabilities can be utilized just fine.
        
       | 37ef_ced3 wrote:
       | Domain-specific compilers that generate explicit SIMD code from a
       | high-level specification are even nicer. These can fully exploit
       | the capabilities of the instruction set (e.g., fast permutes,
       | masking, reduced precision floats, large register file, etc.) for
       | a particular domain
       | 
       | For example, generating AVX-512 code for convnet inference:
       | https://NN-512.com
       | 
       | NN-512 does four simultaneous 8x8 Winograd tiles (forward and
       | backward) in the large AVX-512 register file, accelerates strided
       | convolutions by interleaving Fourier transforms (again, with
       | knowledge of the large register file), makes heavy use of the two
       | input VPERMI2PS permutation instructions, generates simplified
       | code with precomputed masks around tensor edges, uses
       | irregular/arbitrary tiling patterns, etc. It generates code like
       | this:
       | 
       | https://nn-512.com/example/11
       | 
       | This kind of compiler can be written for any important domain
        
         | the_optimist wrote:
         | This is great, but it doesn't address GPUs. If you built it for
         | GPUs, from what I understand, that outcome would basically look
         | like tensorflow, or maybe tensorflow XLA. Is that right?
        
           | 37ef_ced3 wrote:
           | My point is that a less general compiler can yield better
           | SIMD code for a particular domain, and be easier to use for a
           | particular domain. And I gave a concrete illustration
           | (NN-512) to support that claim
           | 
           | Consider NN-512 (less general) versus Halide (more general).
           | Imagine how hard it would be to make Halide generate the
           | programs that NN-512 generates. It would be a very
           | challenging problem
        
             | the_optimist wrote:
             | Understood: NN-512 is a local optimum in an optimization of
             | hardware and problem structure.
        
       | the_optimist wrote:
       | Compiling from high-level lang to GPU is a huge problem, and we
       | greatly appreciate efforts to solve it.
       | 
       | If I understand correctly, this (CM) allows for C-style fine-
       | level control over a GPU device as though it were a CPU.
       | 
       | However, it does not appear to address data transit (critical for
       | performance). Compilation and operator fusing to minimize transit
       | is possibly more important. See Graphcore Poplar, Tensorflow XLA,
       | Arrayfire, Pytorch Glow, etc.
       | 
       | Further, this obviously only applies to Intel GPUs, so investing
       | time in utilizing low-level control is possibly a hardware dead-
       | end.
       | 
       | Dream world for programmers is one where data transit and
       | hardware architecture are taken into account without living
       | inside a proprietary DSL Conversely, it is obviously against
       | hardware manufacturers' interests to create this.
       | 
       | Is MLIR / LLVM going to solve this? This list has been
       | interesting to consider:
       | 
       | https://github.com/merrymercy/awesome-tensor-compilers
        
         | baybal2 wrote:
         | > is possibly a hardware dead-end.
         | 
         | I'm thinking the opposite, there been an unending succession of
         | different accelerators for doing this, and that which
         | eventually been obsoleted, and forgotten when general purpose
         | CPU caught up to them in performance, or comp-sci learned how
         | to do calculations more efficiently on mainstream hardware.
         | 
         | Just by seeing how morbid are the sales of the new "NPUs," I
         | can guess it's already happening.
         | 
         | A number of cellphone brands experimented with such to run
         | selfie filters, or do speech recognition, but later found that
         | those work on CPUs not any worse at all if competent
         | programmers are hired, and then threw the NPU hardware out, or
         | stopped using them.
        
         | banachtarski wrote:
         | I'm not a hardware engineer, but I am a GPU-focused graphics
         | engineer.
         | 
         | > C-style fine-level control over a GPU device as though it
         | were a CPU.
         | 
         | Personally, I think this is a fool's errand, and this has
         | nothing to do with my desire for job security or anything. When
         | I look at how code in the ML world is written for a GPU for
         | example, it's really easy to see why it's so slow. The CPU and
         | GPU architectures are fundamentally different. Different
         | pipelining architecture, scalar instead of vector, 32/64-wide
         | instruction dispatches, etc. HLSL/GLSL and other such shader
         | languages are perfectly "high level" with other needed
         | intrinsics needed to perform relevant warp level barriers, wave
         | broadcasts/ballots/queries, use LDS storage, execute device
         | level barriers, etc. This isn't to say that high level shader
         | language improvements aren't welcome, but that trying to
         | emulate a CPU is an unfortunate goal.
        
           | mpweiher wrote:
           | What kinds of improvements would you like to see?
        
       | moonbug wrote:
       | ain't no one gonna use that.
        
       | skavi wrote:
       | Why are Intel GPUs designed in such a way that typical GPU
       | languages don't fully exploit it? Is the new Xe architecture
       | still SIMD?
        
         | dragontamer wrote:
         | OpenCL works on Intel GPUs, while CUDA doesn't because CUDA is
         | an NVidia technology.
         | 
         | > Is the new Xe architecture still SIMD?
         | 
         | SIMD is... pretty much all GPUs do. There's a few scalar bits
         | here and there to speed up if-statements and the like, but the
         | entire point of a GPU is to build a machine for SIMD.
        
           | astrange wrote:
           | GPUs don't need to have SIMD instructions; if you give one a
           | fully scalar program it just needs to run a lot of copies of
           | it at once. Every architecture is different here, including
           | within the same vendor.
        
             | dragontamer wrote:
             | > GPUs don't need to have SIMD instructions;
             | 
             | Except NVidia Ampere (RTX 3xxx series) and AMD RDNA2 (Navi
             | / 6xxx series) are both SIMD architectures with SIMD-
             | instructions.
             | 
             | And the #3 company: Intel, also has SIMD instructions. I
             | know that some GPUs out there are VLIW or other weird
             | architectures, but... the "big 3" are SIMD-based.
             | 
             | > if you give one a fully scalar program it just needs to
             | run a lot of copies of it at once.
             | 
             | Its emulated on a SIMD processor. That SIMD processor will
             | suffer branch-divergence as you traverse through if-
             | statements and while-statements, because its physically
             | SIMD.
             | 
             | The compiler / programming model is scalar. But the
             | assembly instructions are themselves vector. Yeah, NVidia
             | now has per-SIMD core instruction pointers. But that
             | doesn't mean that the hardware can physically execute
             | different instructions: they're still all locked together
             | with SIMD-style at the physical level.
        
             | raphlinus wrote:
             | That's partly true, but there are exceptions, of which the
             | subgroup operations are the most obvious. These are roughly
             | similar to broadcast and permute SIMD instructions, and in
             | some cases can lead to dramatic speedups.
        
           | oivey wrote:
           | OpenCL is basically dead at this point, too. The de facto
           | standard is CUDA and there aren't currently any real
           | challengers. Maybe eventually AMD's ROCm or Intel's oneAPI
           | will get traction.
        
             | pjmlp wrote:
             | For them to get traction, they need to invest in debugger
             | tooling that allows the productivity as on CPUs, and to
             | help language communities other than C and C++ to target
             | GPGPUs.
             | 
             | NVidia started doing both around CUDA 3.0, whereas Khronos,
             | AMD and Intel only started paying attention that not
             | everyone wanted to do printf() style debugging with a C
             | dialect until it was too late to get people's attention
             | back.
        
             | profquail wrote:
             | oneAPI uses DPC++ (Data-Parallel C++), which is pretty much
             | just SYCL, which itself is a C++ library on top of OpenCL.
             | 
             | From my understanding, the Khronos group realized OpenCL
             | 2.x was much too complicated so vendors just weren't
             | implementing it, or only implementing parts of it, so they
             | came up with OpenCL 3.0 which is slimmed-down and much more
             | modular. It's hard to say how much adoption it'll get, but
             | with Intel focused on DPC++ and oneAPI now, there will
             | definitely be more numerical software coming out in the
             | next few years that compiles down to and runs on OpenCL.
             | 
             | For example, Intel engineers are building a numpy clone on
             | top of DPC++, so unlike regular numpy it'll take advantage
             | of multiple CPU cores: https://github.com/IntelPython/dpnp
        
               | astrange wrote:
               | > From my understanding, the Khronos group realized
               | OpenCL 2.x was much too complicated so vendors just
               | weren't implementing it, or only implementing parts of
               | it, so they came up with OpenCL 3.0 which is slimmed-down
               | and much more modular.
               | 
               | Something like this also happened to OpenGL 4.3. It added
               | a compute shader extension which was essentially all of
               | OpenCL again, except different, so you had 2x the
               | implementation work. This is about when some people
               | stopped implementing OpenGL.
        
               | TazeTSchnitzel wrote:
               | OpenGL compute shaders are a natural step if you have
               | unified programmable shader cores, and less complicated
               | than adding new pipeline stages for everything
               | (tessellation shaders, geometry shaders, ...).
               | 
               | Khronos could have chosen only to add OpenCL integration,
               | but OpenCL C is a very different language to GLSL, the
               | memory model (among other things) is different, and so
               | on. I don't see why video game developers should be
               | forced to use OpenCL when they want to work with the
               | outputs of OpenGL rendering passes, to produce inputs to
               | OpenGL rendering passes, scheduled in OpenGL, to do
               | things that don't fit neatly into vertex or fragment
               | shaders?
        
               | pjmlp wrote:
               | Kind of right.
               | 
               | DPC++ has more stuff than just SYSCL, some of it might
               | find its way back to SYSCL standardization, some of it
               | might remain Intel only.
               | 
               | OpenCL 3.0 is basically OpenCL 1.2 with a new name.
               | 
               | Meanwhile people are busy waiting for Vulkan compute to
               | take off, got to love Khronos standards.
        
               | TazeTSchnitzel wrote:
               | Some people are working on being able to run OpenCL
               | kernels on Vulkan: https://github.com/google/clspv
        
               | pjmlp wrote:
               | Sure, but it will take off to actually matter?
               | 
               | So far I am only aware of Adobe using it to port their
               | shaders to Vulkan on Android.
        
       ___________________________________________________________________
       (page generated 2021-01-29 23:00 UTC)