[HN Gopher] Unifying the CUDA Python Ecosystem
       Unifying the CUDA Python Ecosystem
       Author : pjmlp
       Score  : 108 points
       Date   : 2021-04-16 14:44 UTC (7 hours ago)
 (HTM) web link (developer.nvidia.com)
 (TXT) w3m dump (developer.nvidia.com)
       | andrew_v4 wrote:
       | Just for contrast its interesting to look at an example of
       | writing a similar kernel in Julia:
       | https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/
       | I don't think it's possible to achieve something like this in
       | python because of how it's interpreted (but it sounds a bit like
       | what another comment mentioned where the python was compiled to
       | C)
         | rrss wrote:
         | I think the contrast is probably less about the language, and
         | more about the scope and objective of the projects. the blog is
         | describing low-level interfaces in python - probably more
         | comparable is the old CUDAdrv.jl package (now merged into
         | CUDA.jl):
         | https://github.com/JuliaGPU/CUDAdrv.jl/blob/master/examples/...
         | here is writing a similar kernel in python with numba:
         | https://github.com/ContinuumIO/gtc2017-numba/blob/master/4%2...
           | jjoonathan wrote:
           | I gave numba CUDA a spin in late 2018 and was severely
           | disappointed. It didn't work out of the box, I had to tweak
           | the source to remove a reference to an API that had been
           | removed from CUDA more than a year prior (and deprecated long
           | ago). Then I ran into a bug when converting a float array to
           | a double array -- I had to declare the types three different
           | times and it still did a naive byte-copy rather than a
           | conversion. Thanks to a background in numerics, the symptoms
           | were obvious, but yikes. The problem that finally did us in
           | was an inability to get buffers to correctly pass between
           | kernels without a CPU copy, which was absolutely critical for
           | our perf. I think this was supported in theory but just
           | didn't work.
           | In any case, we did a complete rewrite in CUDA proper in less
           | time than we spent banging our heads against that last numba-
           | CUDA issue.
           | Under every language bridge there are trolls and numba-CUDA
           | had some mean ones. Hopefully things have gotten better but
           | I'm definitely still inside the "once bitten twice shy"
           | period.
         | machineko wrote:
         | Every time there is a topic about python, there is this one
         | Julia guy who spam Julia "alternative" for python solution in
         | every topic. Can you just guys stop? it kinda feels like
         | watching a cult.
         | anon_tor_12345 wrote:
         | i mentioned this in the response to the other comment but
         | straight compilation is exactly what numba does for CUDA
         | support because, just like Julia, numba uses llvm as a
         | middleend (and llvm has a ptx backend).
         | albertzeyer wrote:
         | JAX and TensorFlow functions both would convert some Python
         | code to equivalent XLA code or a TF graph.
         | jjoonathan wrote:
         | > Julia has first-class support for GPU programming
         | "First-class" is a steep claim. Does it support the nvidia perf
         | tools? Those are very important for taking a kernel from (in my
         | experience) ~20% theoretical perf to ~90% theoretical perf.
           | maleadt wrote:
           | Yeah, see this section of the documentation:
           | https://juliagpu.gitlab.io/CUDA.jl/development/profiling/.
           | CUDA.jl also supports NVTX, wraps CUPTI, etc. The full extent
           | of the APIs and tools is available.
           | Source line association when using PC sampling is currently
           | broken due to a bug in the NVIDIA drivers though (segfaulting
           | when parsing the PTX debug info emitted by LLVM), but I'm
           | told that may be fixed in the next driver.
             | jjoonathan wrote:
             | Nice! I set a reminder to check back in a month.
           | klmadfejno wrote:
           | https://developer.nvidia.com/blog/gpu-computing-julia-
           | progra...
             | jjoonathan wrote:
             | > CUDAnative.jl also [...] generates the necessary line
             | number information for the NVIDIA Visual Profiler to work
             | as expected
             | That sounds very promising, but these tools are usually
             | magnificent screenshot fodder yet they are conspicuously
             | absent from the screenshots so I still have suspicions.
             | Maybe I'll give it a try tonight and report back.
               | maleadt wrote:
               | Here's a screenshot:
               | https://julialang.org/assets/blog/nvvp.png. Or a recent
               | PR when you can see NVTX ranges from Julia:
               | https://github.com/JuliaGPU/CUDA.jl/pull/760
               | jjoonathan wrote:
               | Thanks! Now I believe! :)
       | SloopJon wrote:
       | I thought for sure that someone would have posted a link to that
       | xkcd comic by now. I only dabble with higher-level APIs, so I
       | can't judge this on the merits. If NVIDIA really continues to
       | back this, and follows through on wrapping other libraries like
       | cuDNN, it could be a whole new level of vendor lock in as people
       | start writing code that targets CUDA Python. I think the real
       | test will be whether one of the big projects like PyTorch or
       | TensorFlow gets on board.
         | Ivoah wrote:
         | https://xkcd.com/927/
         | https://xkcd.com/1987/
       | michelpp wrote:
       | About 8 years ago an NVIDIA developer released a tool called
       | Copperhead that let you write CUDA kernels in straight Python
       | that were then compiled to C, no "C-in-a-string" like is shown
       | here. I always thought it was so elegant and had great potential,
       | and I introduced a lot of people in my circle to it, but then it
       | seems NVIDIA buried it.
       | This blog post is great, and we need these kind of tools for
       | sure, but we also need high level expressibility that doesn't
       | require writing kernels in C. I know there are other projects
       | that have taken up that cause, but it would be great to see
       | NVIDIA double down on something like Copperhead.
         | ZeroCool2u wrote:
         | Totally agree, Copperhead looks much easier to use. Perhaps one
         | of the reasons they went and rebuilt from scratch is because
         | Copperhead relies on Thrust and a couple other dependencies?
         | anon_tor_12345 wrote:
         | that project might be abandoned but this strategy is used in
         | nvidia and nvidia adjacent projects (through llvm):
         | https://github.com/rapidsai/cudf/blob/branch-0.20/python/cud...
         | https://github.com/gmarkall/numba/blob/master/numba/cuda/com...
         | >but we also need high level expressibility that doesn't
         | require writing kernels in C
         | the above are possible because C is actually just a frontend to
         | PTX
         | https://docs.nvidia.com/cuda/parallel-thread-execution/index...
         | fundamentally you are not going to ever be able to have a way
         | to write cuda kernels without thinking about cuda architecture
         | anymore so than you'll ever be able to write async code without
         | thinking about concurrency.
         | albertzeyer wrote:
         | Oh that sounds interesting. Do you know what happened to it?
         | I think I found it here:
         | https://github.com/bryancatanzaro/copperhead
         | But I'm not sure what the state is. Looks dead (last commit 8
         | years ago). Probably just a proof of concept. But why hasn't
         | this been continued?
         | Blog post and example:
         | https://developer.nvidia.com/blog/copperhead-data-parallel-p...
         | https://github.com/bryancatanzaro/copperhead/blob/master/sam...
         | Btw, for compiling on-the-fly from a string, I made something
         | similar for our RETURNN project. Example for LSTM:
         | https://github.com/rwth-i6/returnn/blob/a5eaa4ab1bfd5f157628...
         | This is made in a way that it compiles automatically into an op
         | for Theano or TensorFlow (PyTorch could easily be added as
         | well) and for both CPU and CUDA/GPU.
           | dwrodri wrote:
           | I don't know specifics about Copperhead in particular, but
           | Bryan Catanzaro (creator of Copperhead) is now the VP of
           | Applied Deep Learning Research at Nvidia. He gave a talk at
           | GTC this year, which is how I heard about all of this in the
           | first place.
           | Source: https://www.linkedin.com/in/bryancatanzaro/
         | BiteCode_dev wrote:
         | In the IP word, there are some-hidden gems that disappear with
         | no trace one day.
         | I worked for a client that had this wonderful Python dsl that
         | compiled to verilog and vhdl. It was much easier to use than
         | writing the stuff the old way. Much more composable too, not to
         | mention tooling.
         | They created that by forking an open source project dating back
         | to Python 2.5 that I could never find again.
         | Imagine if that stuff would still be alive today. You could
         | have a market for paid pypi.org instances providing you with
         | pip installable IP components you can compose and customize
         | easily.
         | But in this market, sharing is not really a virtue.
         | eslaught wrote:
         | As it turns out, NVIDIA just open sourced a product called
         | Legate which does not just GPUs but distributed as well. Right
         | now it supports NumPy and Pandas but perhaps they'll add others
         | in the future. Just thought this might be up your alley since
         | it works at a higher level than the glorified CUDA in the
         | article.
         | https://github.com/nv-legate/legate.numpy
         | Disclaimer: I work on the project they used to do the
         | distributed execution, but otherwise have no connection with
         | Legate.
         | Edit: And this library was developed by a team managed by one
         | of the original Copperhead developers, in case you're
         | wondering.
       | nuisance-bear wrote:
       | Tools to make GPU development easier are sorely needed.
       | I foolishly built an options pricing engine on top of PyTorch,
       | thinking "oooh, it's a fast array library that supports CUDA
       | transparently". Only to find out that array indexing is 100x
       | slower than numpy.
         | eslaught wrote:
         | You might be interested in Legate [1]. It supports the NumPy
         | interface as a drop-in replacement, supports GPUs and also
         | distributed machines. And you can see for yourself their
         | performance results; they're not far off from hand-tuned MPI.
         | [1]: https://github.com/nv-legate/legate.numpy
         | Disclaimer: I work on the library Legate uses for distributed
         | computing, but otherwise have no connection.
           | [deleted]
         | TuringNYC wrote:
         | >>> built an options pricing engine on top of PyTorch
         | I'd love to hear more about this! Do you have any posts or
         | write-ups on this?
         | sideshowb wrote:
         | Interesting find about the indexing. I just had the opposite
         | experience, swapped from numpy to torch in a project and got
         | 2000x speedup on some indexing and basic maths wrapped in
         | autodiff. And I haven't moved it onto cuda yet.
           | nuisance-bear wrote:
           | Here's an example that illustrates the phenomenon. If memory
           | serves me right, index latency is superlinear in dimension
           | count.                  import time, torch        from
           | itertools import product             N = 100             ten
           | = torch.randn(N,N,N)        arr = ten.numpy()             def
           | indexTimer(val):            start = time.time()
           | for i,j,k in product(range(N), range(N), range(N)):
           | x = val[i, j, k]            end = time.time()
           | print('{:.2f}'.format(end-start))             indexTimer(ten)
           | indexTimer(arr)
       | rubatuga wrote:
       | Somewhat related, I've tried running compute shaders using wgpu-
       | py:
       | https://github.com/pygfx/wgpu-py
       | You can define any compute shader you like in Python, and
       | annotate it with the data types, and it compiles to SPIRV and
       | runs under macOS, Linux and windows
       | The_rationalist wrote:
       | Note that you can write CUDA in many languages such as Java,
       | Kotlin, Python, Ruby, JS, R with https://github.com/NVIDIA/grcuda
       | zcw100 wrote:
       | There's a lot of may, should, and could's in there.
       | nevi-me wrote:
       | I have a RTX 2070 that's under-utilised, partly because I'm
       | surprisingly finding it hard to understand C, C++ and CUDA by
       | extension.
       | I'm self-taught, and have been using web languages and some
       | python, before learning Rust. I hope that NVIDIA can dedicate
       | some resources to creating high-quality bindings to the C API for
       | Rust, even if in the next 1-2 years.
       | Perhaps being able to use a systems language that's been easy for
       | me coming from TypeScript and Kotlin, could inspire me to take
       | baby steps with CUDA, without worrying about understanding C.
       | I like the CUDA.jl package, and once I make time to learn Julia,
       | I would love to try that out. From this article about the Python
       | library, I'm still left knowing very little about "how can I
       | parallelise this function".
         | jkelleyrtp wrote:
         | +1 Would love to see official support for CUDA for Rust.
         | sdajk3n423 wrote:
         | If you are a looking to maximize use of that card, you can make
         | about $5 a day mining crypto with the 2070.
           | nevi-me wrote:
           | No, the high electricity cost in my country + the noise
           | pollution in the house + how much I generally earn from the
           | machine + my views on burning the world speculatively,
           | discourage me from mining crypto.
           | Perhaps my position might change in future, but for now, I'd
           | probably rather make the GPU accessible to those open-source
           | distributed grids that train chess engines or compute deep-
           | space related thingies :)
             | sdajk3n423 wrote:
             | I am not convinced that training AI to win at chess is any
             | more moral than mining crypto. And the block chain is about
             | as open-source as you can get.
         | pjmlp wrote:
         | A nice thing of the proper ALGOL linage systems programming
         | languages (which C only has basic influence), is that you can
         | write nice high level code and only deal with pointers and raw
         | pointer stuff when actually needed, think Ada, Modula-2, Object
         | Pascal kind of languages.
         | So something like CUDA Rust would be nice to have.
         | By the way, D already supports CUDA,
         | https://dlang.org/blog/2017/07/17/dcompute-gpgpu-with-native...
           | touisteur wrote:
           | CUDA Ada would be so, so nice. Especially with non-aliasing
           | guarantees from SPARK...
         | Tomte wrote:
         | > I have a RTX 2070 that's under-utilised
         | I've found that there are really good and beginner-friendly
         | Blender tutorials. Both free and paid ones.
       | andi999 wrote:
       | Actually I like pyCuda.
       | https://documen.tician.de/pycuda/tutorial.html
       | You can write all the boilerplate in python and just the kernel
       | in C (which you can pass to a string and compiler automatically
       | in your python script). So far the workflow is much smoother than
       | with nvcc (and creating some dll bindings for the c programm).
       | kolbe wrote:
       | As someone who has dabbled in CUDA with some success, I'm going
       | to be a little contrarian here. To me, the difficulty with GPU
       | programming isn't the fact that CUDA uses C-syntax versus
       | something more readable like Python. GPU programming is
       | fundamentally difficult, and the minor gains from using a
       | familiar language syntax are dwarfed by the need to understand
       | blocks, memory alignment, thread hierarchy, etc. And I don't just
       | say this. I live it. Even though I primarily program in C#, I
       | don't use Hybridizer when I need GPU acceleration. I go straight
       | to CUDA and marshal everything to/from C#.
       | That's not to say that CUDA Python isn't kinda cool, but it's not
       | a magic bullet to finally understanding GPU programming if you've
       | been struggling.
       (page generated 2021-04-16 22:00 UTC)