[HN Gopher] Triton: Open-Source GPU Programming for Neural Networks
       ___________________________________________________________________
        
       Triton: Open-Source GPU Programming for Neural Networks
        
       Author : ag8
       Score  : 143 points
       Date   : 2021-07-28 16:18 UTC (2 hours ago)
        
 (HTM) web link (www.openai.com)
 (TXT) w3m dump (www.openai.com)
        
       | thebruce87m wrote:
       | Unfortunate name clash with NVIDIAs Triton Inference Server:
       | https://developer.nvidia.com/nvidia-triton-inference-server
        
         | [deleted]
        
         | polynomial wrote:
         | My first thought exactly. This will cause nothing but confusion
         | and Triton (the inference server) is well integrated into the
         | space.
         | 
         | So it's especially weird to see it coming from OpenAI, and not
         | a more random startup. It honestly makes no sense they would
         | deliberately do this, unless there is some secret cult of
         | Triton that is going on in the Bay Area world of AI/ML.
        
         | 6gvONxR4sf7o wrote:
         | The author commented on reddit about that (https://www.reddit.c
         | om/r/MachineLearning/comments/otdpkx/n_i...)
         | 
         | > PS: The name Triton was coined in mid-2019 when I released my
         | PhD paper on the subject
         | (http://www.eecs.harvard.edu/~htk/publication/2019-mapl-
         | tille...). I chose not to rename the project when the Triton
         | inference server came out a year later since it's the only
         | thing that ties my helpful PhD advisors to the project.
        
       | shubuZ wrote:
       | I have found writing CUDA code is much simpler than writing
       | correct multi-threaded AVX2/AVX-512 code.
        
         | dragontamer wrote:
         | If you need CPU-side SIMD, then try ispc:
         | https://ispc.github.io/
         | 
         | Its pretty much the OpenCL-model, except it compiles into AVX2
         | code / AVX512 code. Very similar to CUDA / OpenCL style
         | programming. Its not single-source like CUDA, but it largely
         | accomplishes the programming model IMO.
        
           | gnufx wrote:
           | Why not a standard? OpenMP is more than pretty much C(++) and
           | Fortran, and has offload inspired by the needs of the Sierra
           | supercomputer.
        
       | mvanaltvorst wrote:
       | As a sidenote, these SVG graphs are absolutely beautiful. The
       | flow diagram is even responsive on mobile!
        
       | riyadparvez wrote:
       | I am confused. Is it another competitor of Tensorflow, JAX, and
       | Pytorch? Or something else?
        
         | peytoncasper wrote:
         | I believe this is more of an optimization layer to be utilized
         | by libraries like Tensorflow and JAX. More of a simplification
         | of the interaction with traditional CUDA instructions.
         | 
         | I imagine these libraries and possibly some users would
         | implement libraries on top of this language and reap some of
         | the optimization benefit without having to maintain low-level
         | CUDA specific code.
        
           | blueblisters wrote:
           | So is this similar to XLA?
        
             | jpf0 wrote:
             | XLA is domain-specific compiler for linear algebra. Triton
             | generates and compiles an intermediate representation for
             | tiled computation. This IR allows more general functions
             | and also claims higher performance.
             | 
             | obligatory reference to the family of work:
             | https://github.com/merrymercy/awesome-tensor-compilers
        
       | giacaglia wrote:
       | OpenAI keeps innovating. Amazing to see the speed of execution of
       | the team
        
         | ipsum2 wrote:
         | This guy developed Triton for his PhD thesis, and OpenAI hired
         | him to continue working on it. Doesn't really seem fair to give
         | all the innovation credit to OpenAI.
         | 
         | See:
         | https://www.reddit.com/r/MachineLearning/comments/otdpkx/n_i...
        
       | notthedroids wrote:
       | Does Triton support automatic differentiation? I don't see that
       | feature in a quick poke through the docs.
       | 
       | If it does compile to LLVM, I suppose it can use Enzyme
       | https://enzyme.mit.edu/
        
       | croes wrote:
       | Too bad it's CUDA Sooner or later this will become a problem
       | because you are depending on the benevolence of a single
       | manufacturer.
        
         | shubuZ wrote:
         | Which other hardware vendor provides the level of performance
         | that Nvidia's GPU provide? Wasnt the benevolence on single (or
         | couple) manufacturer(s) true in 90s, 2020s?
        
           | hhh wrote:
           | Tenstorrent.
        
           | croes wrote:
           | It's not about performance but open standards. Remember
           | Oracle vs. Google, at some time in the future NVidia could
           | decide to get money out of CUDA.
        
           | dragontamer wrote:
           | AMD's MI100 is slightly faster than NVidia A100 for double-
           | precision FLOPs at slightly lower costs. Good enough for Oak
           | Ridge National Labs (Frontier Supercomputer), to say the
           | least.
           | 
           | NVidia is faster at 4x4 16-bit matrix multiplications (common
           | in Tensor / Deep Learning stuff), but MI100 still has 4x4
           | 16-bit matrix multiplication instructions and acceleration.
           | Its not far behind, and the greater 64-bit FLOPs is enough to
           | win in scientific fields.
        
         | dragontamer wrote:
         | AMD's ROCm 4.0 now supports cooperative groups, which is
         | probably one of the last major holdouts for CUDA compatibility.
         | 
         | There's still the 64-wavefront (for AMD CDNA cards) instead of
         | 32-wavefronts (for CUDA). But AMD even has 4x4 half-float
         | matrix multiplication instructions in ROCm (for MI100, the only
         | card that supports the matrix-multiplication / tensor
         | instructions)
         | 
         | ---------
         | 
         | I think CUDA vs OpenCL is over. ROCm from AMD has its
         | restrictions, but... it really is easier to program than
         | OpenCL. Its a superior model: having a single language that
         | supports both CPU and GPU code is just easier than switching
         | between C++ and OpenCL (where data-structures can't be shared
         | as easily).
         | 
         | -----------
         | 
         | The main issue with AMD is that they're cutting support for
         | their older cards. The cheapest card you can get that supports
         | ROCm is Vega56 now... otherwise you're basically expected to go
         | for the expensive MI-line (MI50, MI100).
        
           | meragrin_ wrote:
           | > The main issue with AMD is that they're cutting support for
           | their older cards.
           | 
           | No, their main issue is not properly supporting ROCm in
           | general. No Windows support at all? It still feels like they
           | don't know whether they want to continue investing in ROCm
           | long term.
        
             | dragontamer wrote:
             | I'd assume that they're gonna support ROCm as long as the
             | Frontier deployment at Oak Ridge National Labs is up. ORNLs
             | isn't exactly a customer you want to piss off.
        
           | zozbot234 wrote:
           | The new contest is not CUDA vs. OpenCL but CUDA vs. Vulkan
           | Compute. As support for Vulkan in hardware becomes more
           | widespread, it makes more and more sense to just standardize
           | on it for all workloads. The programming model is quite
           | different between the two (kernels vs. shaders) and OpenCL
           | 2.x has quite a few features that are not in Vulkan, but the
           | latest version of OpenCL has downgraded many of these to
           | extensions.
        
             | dragontamer wrote:
             | CUDA programmers choose CUDA because when you make a struct
             | FooBar{}; in CUDA, it works on both CPU-side and GPU-side.
             | 
             | Vulkan / OpenCL / etc. etc. don't have any data-structure
             | sharing like that with the host code. Its a point of
             | contention that makes anything more complicated than a
             | 3-dimensional array hard to share.
             | 
             | Yeah, Vulkan / OpenCL have all sorts of pointer-sharing
             | arrangements (Shared Virtual Memory) or whatnot. But its
             | difficult to use in practice, because they keep the
             | concepts of "GPU" code separate from "CPU" code.
             | 
             | ---------
             | 
             | When you look at these things: such as Triton, you see that
             | people want to unify the CPU-and-GPU code into a single
             | code base. Look and read these Triton examples: they're
             | just Python code, inside of the rest of CPU-Python code.
             | 
             | I think people are realizing that high-level code can flow
             | between the two execution units (CPU or GPU) without
             | changing the high level language. The compiler works hard
             | to generate code for both systems, but its better for the
             | compiler to work rather than the programmer to work on
             | integration.
        
         | yumraj wrote:
         | Sure, but if this abstraction layer becomes popular, then it
         | becomes much easier to support other GPUs without requiring
         | client libraries to change, which is a much harder problem.
        
           | dragontamer wrote:
           | A big reason why CUDA is popular with compilers is that the
           | PTX assembly-ish language is well documented and reasonable.
           | 
           | Compilers generate PTX, then the rest of the CUDA
           | infrastructure turns PTX into Turing machine code, or Ampere
           | machine code, or Pascal machine code.
           | 
           | In theory, SPIR-V should do the same job, but its just not as
           | usable right now. In the meantime, getting it to work on PTX
           | is easier, and then there's probably hope (in the far future)
           | to move to SPIR-V if that ever actually takes off.
           | 
           | I'm not a developer on Triton, but that'd be my expectation.
        
       | boulos wrote:
       | Folks might find the author's research paper [1] while at Harvard
       | more informative. This is a great high-level description, but if
       | you want more detail, I recommend the paper.
       | 
       | [1] https://dl.acm.org/doi/abs/10.1145/3315508.3329973
        
       | xmaayy wrote:
       | I wonder if this can be used for graphics programming. Shaders
       | are notoriously hard to write correctly and this seems like it
       | might provide an easier gateway than OpenGLSL
        
       ___________________________________________________________________
       (page generated 2021-07-28 19:00 UTC)