[HN Gopher] Triton: Open-Source GPU Programming for Neural Networks ___________________________________________________________________ Triton: Open-Source GPU Programming for Neural Networks Author : ag8 Score : 143 points Date : 2021-07-28 16:18 UTC (2 hours ago) (HTM) web link (www.openai.com) (TXT) w3m dump (www.openai.com) | thebruce87m wrote: | Unfortunate name clash with NVIDIAs Triton Inference Server: | https://developer.nvidia.com/nvidia-triton-inference-server | [deleted] | polynomial wrote: | My first thought exactly. This will cause nothing but confusion | and Triton (the inference server) is well integrated into the | space. | | So it's especially weird to see it coming from OpenAI, and not | a more random startup. It honestly makes no sense they would | deliberately do this, unless there is some secret cult of | Triton that is going on in the Bay Area world of AI/ML. | 6gvONxR4sf7o wrote: | The author commented on reddit about that (https://www.reddit.c | om/r/MachineLearning/comments/otdpkx/n_i...) | | > PS: The name Triton was coined in mid-2019 when I released my | PhD paper on the subject | (http://www.eecs.harvard.edu/~htk/publication/2019-mapl- | tille...). I chose not to rename the project when the Triton | inference server came out a year later since it's the only | thing that ties my helpful PhD advisors to the project. | shubuZ wrote: | I have found writing CUDA code is much simpler than writing | correct multi-threaded AVX2/AVX-512 code. | dragontamer wrote: | If you need CPU-side SIMD, then try ispc: | https://ispc.github.io/ | | Its pretty much the OpenCL-model, except it compiles into AVX2 | code / AVX512 code. Very similar to CUDA / OpenCL style | programming. Its not single-source like CUDA, but it largely | accomplishes the programming model IMO. | gnufx wrote: | Why not a standard? OpenMP is more than pretty much C(++) and | Fortran, and has offload inspired by the needs of the Sierra | supercomputer. | mvanaltvorst wrote: | As a sidenote, these SVG graphs are absolutely beautiful. The | flow diagram is even responsive on mobile! | riyadparvez wrote: | I am confused. Is it another competitor of Tensorflow, JAX, and | Pytorch? Or something else? | peytoncasper wrote: | I believe this is more of an optimization layer to be utilized | by libraries like Tensorflow and JAX. More of a simplification | of the interaction with traditional CUDA instructions. | | I imagine these libraries and possibly some users would | implement libraries on top of this language and reap some of | the optimization benefit without having to maintain low-level | CUDA specific code. | blueblisters wrote: | So is this similar to XLA? | jpf0 wrote: | XLA is domain-specific compiler for linear algebra. Triton | generates and compiles an intermediate representation for | tiled computation. This IR allows more general functions | and also claims higher performance. | | obligatory reference to the family of work: | https://github.com/merrymercy/awesome-tensor-compilers | giacaglia wrote: | OpenAI keeps innovating. Amazing to see the speed of execution of | the team | ipsum2 wrote: | This guy developed Triton for his PhD thesis, and OpenAI hired | him to continue working on it. Doesn't really seem fair to give | all the innovation credit to OpenAI. | | See: | https://www.reddit.com/r/MachineLearning/comments/otdpkx/n_i... | notthedroids wrote: | Does Triton support automatic differentiation? I don't see that | feature in a quick poke through the docs. | | If it does compile to LLVM, I suppose it can use Enzyme | https://enzyme.mit.edu/ | croes wrote: | Too bad it's CUDA Sooner or later this will become a problem | because you are depending on the benevolence of a single | manufacturer. | shubuZ wrote: | Which other hardware vendor provides the level of performance | that Nvidia's GPU provide? Wasnt the benevolence on single (or | couple) manufacturer(s) true in 90s, 2020s? | hhh wrote: | Tenstorrent. | croes wrote: | It's not about performance but open standards. Remember | Oracle vs. Google, at some time in the future NVidia could | decide to get money out of CUDA. | dragontamer wrote: | AMD's MI100 is slightly faster than NVidia A100 for double- | precision FLOPs at slightly lower costs. Good enough for Oak | Ridge National Labs (Frontier Supercomputer), to say the | least. | | NVidia is faster at 4x4 16-bit matrix multiplications (common | in Tensor / Deep Learning stuff), but MI100 still has 4x4 | 16-bit matrix multiplication instructions and acceleration. | Its not far behind, and the greater 64-bit FLOPs is enough to | win in scientific fields. | dragontamer wrote: | AMD's ROCm 4.0 now supports cooperative groups, which is | probably one of the last major holdouts for CUDA compatibility. | | There's still the 64-wavefront (for AMD CDNA cards) instead of | 32-wavefronts (for CUDA). But AMD even has 4x4 half-float | matrix multiplication instructions in ROCm (for MI100, the only | card that supports the matrix-multiplication / tensor | instructions) | | --------- | | I think CUDA vs OpenCL is over. ROCm from AMD has its | restrictions, but... it really is easier to program than | OpenCL. Its a superior model: having a single language that | supports both CPU and GPU code is just easier than switching | between C++ and OpenCL (where data-structures can't be shared | as easily). | | ----------- | | The main issue with AMD is that they're cutting support for | their older cards. The cheapest card you can get that supports | ROCm is Vega56 now... otherwise you're basically expected to go | for the expensive MI-line (MI50, MI100). | meragrin_ wrote: | > The main issue with AMD is that they're cutting support for | their older cards. | | No, their main issue is not properly supporting ROCm in | general. No Windows support at all? It still feels like they | don't know whether they want to continue investing in ROCm | long term. | dragontamer wrote: | I'd assume that they're gonna support ROCm as long as the | Frontier deployment at Oak Ridge National Labs is up. ORNLs | isn't exactly a customer you want to piss off. | zozbot234 wrote: | The new contest is not CUDA vs. OpenCL but CUDA vs. Vulkan | Compute. As support for Vulkan in hardware becomes more | widespread, it makes more and more sense to just standardize | on it for all workloads. The programming model is quite | different between the two (kernels vs. shaders) and OpenCL | 2.x has quite a few features that are not in Vulkan, but the | latest version of OpenCL has downgraded many of these to | extensions. | dragontamer wrote: | CUDA programmers choose CUDA because when you make a struct | FooBar{}; in CUDA, it works on both CPU-side and GPU-side. | | Vulkan / OpenCL / etc. etc. don't have any data-structure | sharing like that with the host code. Its a point of | contention that makes anything more complicated than a | 3-dimensional array hard to share. | | Yeah, Vulkan / OpenCL have all sorts of pointer-sharing | arrangements (Shared Virtual Memory) or whatnot. But its | difficult to use in practice, because they keep the | concepts of "GPU" code separate from "CPU" code. | | --------- | | When you look at these things: such as Triton, you see that | people want to unify the CPU-and-GPU code into a single | code base. Look and read these Triton examples: they're | just Python code, inside of the rest of CPU-Python code. | | I think people are realizing that high-level code can flow | between the two execution units (CPU or GPU) without | changing the high level language. The compiler works hard | to generate code for both systems, but its better for the | compiler to work rather than the programmer to work on | integration. | yumraj wrote: | Sure, but if this abstraction layer becomes popular, then it | becomes much easier to support other GPUs without requiring | client libraries to change, which is a much harder problem. | dragontamer wrote: | A big reason why CUDA is popular with compilers is that the | PTX assembly-ish language is well documented and reasonable. | | Compilers generate PTX, then the rest of the CUDA | infrastructure turns PTX into Turing machine code, or Ampere | machine code, or Pascal machine code. | | In theory, SPIR-V should do the same job, but its just not as | usable right now. In the meantime, getting it to work on PTX | is easier, and then there's probably hope (in the far future) | to move to SPIR-V if that ever actually takes off. | | I'm not a developer on Triton, but that'd be my expectation. | boulos wrote: | Folks might find the author's research paper [1] while at Harvard | more informative. This is a great high-level description, but if | you want more detail, I recommend the paper. | | [1] https://dl.acm.org/doi/abs/10.1145/3315508.3329973 | xmaayy wrote: | I wonder if this can be used for graphics programming. Shaders | are notoriously hard to write correctly and this seems like it | might provide an easier gateway than OpenGLSL ___________________________________________________________________ (page generated 2021-07-28 19:00 UTC)