[HN Gopher] Unifying the CUDA Python Ecosystem ___________________________________________________________________ Unifying the CUDA Python Ecosystem Author : pjmlp Score : 108 points Date : 2021-04-16 14:44 UTC (7 hours ago) (HTM) web link (developer.nvidia.com) (TXT) w3m dump (developer.nvidia.com) | andrew_v4 wrote: | Just for contrast its interesting to look at an example of | writing a similar kernel in Julia: | | https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/ | | I don't think it's possible to achieve something like this in | python because of how it's interpreted (but it sounds a bit like | what another comment mentioned where the python was compiled to | C) | rrss wrote: | I think the contrast is probably less about the language, and | more about the scope and objective of the projects. the blog is | describing low-level interfaces in python - probably more | comparable is the old CUDAdrv.jl package (now merged into | CUDA.jl): | https://github.com/JuliaGPU/CUDAdrv.jl/blob/master/examples/... | | here is writing a similar kernel in python with numba: | https://github.com/ContinuumIO/gtc2017-numba/blob/master/4%2... | jjoonathan wrote: | I gave numba CUDA a spin in late 2018 and was severely | disappointed. It didn't work out of the box, I had to tweak | the source to remove a reference to an API that had been | removed from CUDA more than a year prior (and deprecated long | ago). Then I ran into a bug when converting a float array to | a double array -- I had to declare the types three different | times and it still did a naive byte-copy rather than a | conversion. Thanks to a background in numerics, the symptoms | were obvious, but yikes. The problem that finally did us in | was an inability to get buffers to correctly pass between | kernels without a CPU copy, which was absolutely critical for | our perf. I think this was supported in theory but just | didn't work. | | In any case, we did a complete rewrite in CUDA proper in less | time than we spent banging our heads against that last numba- | CUDA issue. | | Under every language bridge there are trolls and numba-CUDA | had some mean ones. Hopefully things have gotten better but | I'm definitely still inside the "once bitten twice shy" | period. | machineko wrote: | Every time there is a topic about python, there is this one | Julia guy who spam Julia "alternative" for python solution in | every topic. Can you just guys stop? it kinda feels like | watching a cult. | anon_tor_12345 wrote: | i mentioned this in the response to the other comment but | straight compilation is exactly what numba does for CUDA | support because, just like Julia, numba uses llvm as a | middleend (and llvm has a ptx backend). | albertzeyer wrote: | JAX and TensorFlow functions both would convert some Python | code to equivalent XLA code or a TF graph. | jjoonathan wrote: | > Julia has first-class support for GPU programming | | "First-class" is a steep claim. Does it support the nvidia perf | tools? Those are very important for taking a kernel from (in my | experience) ~20% theoretical perf to ~90% theoretical perf. | maleadt wrote: | Yeah, see this section of the documentation: | https://juliagpu.gitlab.io/CUDA.jl/development/profiling/. | CUDA.jl also supports NVTX, wraps CUPTI, etc. The full extent | of the APIs and tools is available. | | Source line association when using PC sampling is currently | broken due to a bug in the NVIDIA drivers though (segfaulting | when parsing the PTX debug info emitted by LLVM), but I'm | told that may be fixed in the next driver. | jjoonathan wrote: | Nice! I set a reminder to check back in a month. | klmadfejno wrote: | https://developer.nvidia.com/blog/gpu-computing-julia- | progra... | jjoonathan wrote: | > CUDAnative.jl also [...] generates the necessary line | number information for the NVIDIA Visual Profiler to work | as expected | | That sounds very promising, but these tools are usually | magnificent screenshot fodder yet they are conspicuously | absent from the screenshots so I still have suspicions. | Maybe I'll give it a try tonight and report back. | maleadt wrote: | Here's a screenshot: | https://julialang.org/assets/blog/nvvp.png. Or a recent | PR when you can see NVTX ranges from Julia: | https://github.com/JuliaGPU/CUDA.jl/pull/760 | jjoonathan wrote: | Thanks! Now I believe! :) | SloopJon wrote: | I thought for sure that someone would have posted a link to that | xkcd comic by now. I only dabble with higher-level APIs, so I | can't judge this on the merits. If NVIDIA really continues to | back this, and follows through on wrapping other libraries like | cuDNN, it could be a whole new level of vendor lock in as people | start writing code that targets CUDA Python. I think the real | test will be whether one of the big projects like PyTorch or | TensorFlow gets on board. | Ivoah wrote: | https://xkcd.com/927/ | | https://xkcd.com/1987/ | michelpp wrote: | About 8 years ago an NVIDIA developer released a tool called | Copperhead that let you write CUDA kernels in straight Python | that were then compiled to C, no "C-in-a-string" like is shown | here. I always thought it was so elegant and had great potential, | and I introduced a lot of people in my circle to it, but then it | seems NVIDIA buried it. | | This blog post is great, and we need these kind of tools for | sure, but we also need high level expressibility that doesn't | require writing kernels in C. I know there are other projects | that have taken up that cause, but it would be great to see | NVIDIA double down on something like Copperhead. | ZeroCool2u wrote: | Totally agree, Copperhead looks much easier to use. Perhaps one | of the reasons they went and rebuilt from scratch is because | Copperhead relies on Thrust and a couple other dependencies? | anon_tor_12345 wrote: | that project might be abandoned but this strategy is used in | nvidia and nvidia adjacent projects (through llvm): | | https://github.com/rapidsai/cudf/blob/branch-0.20/python/cud... | | https://github.com/gmarkall/numba/blob/master/numba/cuda/com... | | >but we also need high level expressibility that doesn't | require writing kernels in C | | the above are possible because C is actually just a frontend to | PTX | | https://docs.nvidia.com/cuda/parallel-thread-execution/index... | | fundamentally you are not going to ever be able to have a way | to write cuda kernels without thinking about cuda architecture | anymore so than you'll ever be able to write async code without | thinking about concurrency. | albertzeyer wrote: | Oh that sounds interesting. Do you know what happened to it? | | I think I found it here: | https://github.com/bryancatanzaro/copperhead | | But I'm not sure what the state is. Looks dead (last commit 8 | years ago). Probably just a proof of concept. But why hasn't | this been continued? | | Blog post and example: | https://developer.nvidia.com/blog/copperhead-data-parallel-p... | https://github.com/bryancatanzaro/copperhead/blob/master/sam... | | Btw, for compiling on-the-fly from a string, I made something | similar for our RETURNN project. Example for LSTM: | https://github.com/rwth-i6/returnn/blob/a5eaa4ab1bfd5f157628... | | This is made in a way that it compiles automatically into an op | for Theano or TensorFlow (PyTorch could easily be added as | well) and for both CPU and CUDA/GPU. | dwrodri wrote: | I don't know specifics about Copperhead in particular, but | Bryan Catanzaro (creator of Copperhead) is now the VP of | Applied Deep Learning Research at Nvidia. He gave a talk at | GTC this year, which is how I heard about all of this in the | first place. | | Source: https://www.linkedin.com/in/bryancatanzaro/ | BiteCode_dev wrote: | In the IP word, there are some-hidden gems that disappear with | no trace one day. | | I worked for a client that had this wonderful Python dsl that | compiled to verilog and vhdl. It was much easier to use than | writing the stuff the old way. Much more composable too, not to | mention tooling. | | They created that by forking an open source project dating back | to Python 2.5 that I could never find again. | | Imagine if that stuff would still be alive today. You could | have a market for paid pypi.org instances providing you with | pip installable IP components you can compose and customize | easily. | | But in this market, sharing is not really a virtue. | eslaught wrote: | As it turns out, NVIDIA just open sourced a product called | Legate which does not just GPUs but distributed as well. Right | now it supports NumPy and Pandas but perhaps they'll add others | in the future. Just thought this might be up your alley since | it works at a higher level than the glorified CUDA in the | article. | | https://github.com/nv-legate/legate.numpy | | Disclaimer: I work on the project they used to do the | distributed execution, but otherwise have no connection with | Legate. | | Edit: And this library was developed by a team managed by one | of the original Copperhead developers, in case you're | wondering. | nuisance-bear wrote: | Tools to make GPU development easier are sorely needed. | | I foolishly built an options pricing engine on top of PyTorch, | thinking "oooh, it's a fast array library that supports CUDA | transparently". Only to find out that array indexing is 100x | slower than numpy. | eslaught wrote: | You might be interested in Legate [1]. It supports the NumPy | interface as a drop-in replacement, supports GPUs and also | distributed machines. And you can see for yourself their | performance results; they're not far off from hand-tuned MPI. | | [1]: https://github.com/nv-legate/legate.numpy | | Disclaimer: I work on the library Legate uses for distributed | computing, but otherwise have no connection. | [deleted] | TuringNYC wrote: | >>> built an options pricing engine on top of PyTorch | | I'd love to hear more about this! Do you have any posts or | write-ups on this? | sideshowb wrote: | Interesting find about the indexing. I just had the opposite | experience, swapped from numpy to torch in a project and got | 2000x speedup on some indexing and basic maths wrapped in | autodiff. And I haven't moved it onto cuda yet. | nuisance-bear wrote: | Here's an example that illustrates the phenomenon. If memory | serves me right, index latency is superlinear in dimension | count. import time, torch from | itertools import product N = 100 ten | = torch.randn(N,N,N) arr = ten.numpy() def | indexTimer(val): start = time.time() | for i,j,k in product(range(N), range(N), range(N)): | x = val[i, j, k] end = time.time() | print('{:.2f}'.format(end-start)) indexTimer(ten) | indexTimer(arr) | rubatuga wrote: | Somewhat related, I've tried running compute shaders using wgpu- | py: | | https://github.com/pygfx/wgpu-py | | You can define any compute shader you like in Python, and | annotate it with the data types, and it compiles to SPIRV and | runs under macOS, Linux and windows | The_rationalist wrote: | Note that you can write CUDA in many languages such as Java, | Kotlin, Python, Ruby, JS, R with https://github.com/NVIDIA/grcuda | zcw100 wrote: | There's a lot of may, should, and could's in there. | nevi-me wrote: | I have a RTX 2070 that's under-utilised, partly because I'm | surprisingly finding it hard to understand C, C++ and CUDA by | extension. | | I'm self-taught, and have been using web languages and some | python, before learning Rust. I hope that NVIDIA can dedicate | some resources to creating high-quality bindings to the C API for | Rust, even if in the next 1-2 years. | | Perhaps being able to use a systems language that's been easy for | me coming from TypeScript and Kotlin, could inspire me to take | baby steps with CUDA, without worrying about understanding C. | | I like the CUDA.jl package, and once I make time to learn Julia, | I would love to try that out. From this article about the Python | library, I'm still left knowing very little about "how can I | parallelise this function". | jkelleyrtp wrote: | +1 Would love to see official support for CUDA for Rust. | sdajk3n423 wrote: | If you are a looking to maximize use of that card, you can make | about $5 a day mining crypto with the 2070. | nevi-me wrote: | No, the high electricity cost in my country + the noise | pollution in the house + how much I generally earn from the | machine + my views on burning the world speculatively, | discourage me from mining crypto. | | Perhaps my position might change in future, but for now, I'd | probably rather make the GPU accessible to those open-source | distributed grids that train chess engines or compute deep- | space related thingies :) | sdajk3n423 wrote: | I am not convinced that training AI to win at chess is any | more moral than mining crypto. And the block chain is about | as open-source as you can get. | pjmlp wrote: | A nice thing of the proper ALGOL linage systems programming | languages (which C only has basic influence), is that you can | write nice high level code and only deal with pointers and raw | pointer stuff when actually needed, think Ada, Modula-2, Object | Pascal kind of languages. | | So something like CUDA Rust would be nice to have. | | By the way, D already supports CUDA, | | https://dlang.org/blog/2017/07/17/dcompute-gpgpu-with-native... | touisteur wrote: | CUDA Ada would be so, so nice. Especially with non-aliasing | guarantees from SPARK... | Tomte wrote: | > I have a RTX 2070 that's under-utilised | | I've found that there are really good and beginner-friendly | Blender tutorials. Both free and paid ones. | andi999 wrote: | Actually I like pyCuda. | https://documen.tician.de/pycuda/tutorial.html | | You can write all the boilerplate in python and just the kernel | in C (which you can pass to a string and compiler automatically | in your python script). So far the workflow is much smoother than | with nvcc (and creating some dll bindings for the c programm). | kolbe wrote: | As someone who has dabbled in CUDA with some success, I'm going | to be a little contrarian here. To me, the difficulty with GPU | programming isn't the fact that CUDA uses C-syntax versus | something more readable like Python. GPU programming is | fundamentally difficult, and the minor gains from using a | familiar language syntax are dwarfed by the need to understand | blocks, memory alignment, thread hierarchy, etc. And I don't just | say this. I live it. Even though I primarily program in C#, I | don't use Hybridizer when I need GPU acceleration. I go straight | to CUDA and marshal everything to/from C#. | | That's not to say that CUDA Python isn't kinda cool, but it's not | a magic bullet to finally understanding GPU programming if you've | been struggling. ___________________________________________________________________ (page generated 2021-04-16 22:00 UTC)