[HN Gopher] What happens when you vectorize wide PyTorch express... ___________________________________________________________________ What happens when you vectorize wide PyTorch expressions? Author : mrcslws Score : 107 points Date : 2023-10-26 13:41 UTC (9 hours ago) (HTM) web link (probablymarcus.com) (TXT) w3m dump (probablymarcus.com) | intalentive wrote: | Did you leave Numenta? Enjoyed the paper discussions you all | posted to YT. | mrcslws wrote: | Glad to hear :) | | Yes, I'm off doing my own thing now. Deep Learning went so much | further than I ever expected, and now I'm drawn to all the | things that can be built today. Who knows, maybe I'll swing | back into neuroscience in a few years. (Still friends with my | old coworkers / bosses.) | gregjm wrote: | > My so-called CPU "active" time is actually an inferred value; | CUDA spins the CPU 100% constantly, even when the CPU is just | waiting for the GPU | | The CUDA Runtime and Driver APIs allow you to use"blocking | synchronization" where the CPU will go to sleep while waiting for | synchronization with the device. However, it seems that PyTorch | doesn't expose this functionality in any of its Python APIs: | | https://github.com/pytorch/pytorch/issues/28224 | | What happens when you try using ctypes to call into libcudart.so | to set the device flags as described in the above issue? You'll | have to call torch.cuda.init() for it to work, and unfortunately | it won't work if PyTorch is launching kernels from other threads. | mrcslws wrote: | Aha, I was hoping to learn about something like this, thanks | for sharing. I'll try this some time. PyTorch does use | different threads for the forward and backward pass, so as you | suggest, setting that flag might only improve the forward pass. | gregjm wrote: | The CUDA Runtime and Driver APIs have per-thread state, so | using threads would unfortunately bypass our trick here to | set the flag. Assuming you're on Linux, I might suggest | creating a shared library to intercept calls to the Driver | API, as all Runtime functions are implemented as wrappers | around Driver functions. You'd have to intercept all calls to | context creation and flag setting: * | `cuCtxCreate` * `cuCtxCreate_v3` * | `cuCtxSetFlags` * `cuDevicePrimaryCtxRetain` | * `cuDevicePrimaryCtxSetFlags` | | ... and make sure that the three least significant bits of | any `flags` variable are set to `CU_CTX_SCHED_BLOCKING_SYNC`. | | cuDevicePrimaryCtxSetFlags: | https://docs.nvidia.com/cuda/cuda-driver- | api/group__CUDA__PR... | | dlsym(3): https://man.archlinux.org/man/dlsym.3.en | | ld.so(8): https://man.archlinux.org/man/ld.so.8.en | bee_rider wrote: | I'm somewhat confused as to what _is_ exposed, as the | description in the quote sounds like a blocking call, but with | a busy wait, which seems like it couldn't be the only or main | thing that PyTorch exposes. | Filligree wrote: | That is indeed the only API that it exposes. | pixelpoet wrote: | I really hope those pow(x, 2) calls are getting turned into x * | x, else it's a performance catastrophe / extreme beginner mistake | even with vectorisation. | | Also, this kind of ultra wide buffering consumes a ton of memory | bandwidth for each operation, instead of keeping a small portion | in cache/registers. FLOPs are scaling sort of infinitely, whereas | memory speed is flat, so this is increasingly a losing game; just | because it's faster than glacial Python doesn't mean it's fast | compared to a language which actually concerns itself with | performance or a more cache aware approach. | | For an extreme example of how you can even sometimes beat ultra | optimised GPU ML libraries in this way, check out | https://github.com/NVlabs/tiny-cuda-nn | mrcslws wrote: | I wondered about this same thing. Your logic about | cache/registers is certainly true on CPUs, but what about GPUs? | Hence this blurb: | | > I studied the CUDA traces closely and found that | vectorization does indeed reduce many aspects of the GPU | workload, greatly reducing the number of operations and | decreasing the total amount of time spent on the fundamental | computations of the algorithm. However it also introduces | overhead (mentioned above) by interspersing operations that | permute and reorder the tensors, or splitting them into groups | then concatenating results. Sometimes the reduced "fundamental" | time outweighs the additional overhead, while other times the | overhead outweighs the reduction in fundamental time. | | Here are some examples not included in the blog post: | | - Total time spent in aten::cdist kernel - | Baseline: 2.834s (4900 calls) - Vectorized: 2.686s (500 | calls) | | - Total time spent in aten::mul kernel - | Baseline: 5.745s (80700 calls) - Vectorized: 5.555s (8100 | calls) | | This nice little win applies to tons of other kernels, almost | across the board. As you point out, CPU intuition suggests this | should have been _slower_ , so this was an interesting outcome. | | On the other hand, some specific increases occur: | | - Total time spent in aten::cat kernel - | Baseline: 0.680s - Vectorized: 1.849s | | So working in fewer, larger batches doesn't _only_ enable | outrunning the GPU. It decreases the total GPU workload... then | adds some overhead. But some of this overhead could be removed | with custom CUDA kernels, so I think this is an interesting | direction even if you solve the CPU problem some other way. | | (The pow(x, 2) is only there in the toy code, not my actual | kernel, so I didn't performance-tune it.) | nixpulvis wrote: | What's the state-of-the-art in terms of compiler optimization | here? Seems like auto-vectorization could be a somewhat simple | transform, no? | voz_ wrote: | Pretty cool to see people using compile in the wild :) | mrcslws wrote: | Yeah, one unspoken theme of this blog post is "look how nice | torch.compile" is :) | | Fun fact, I had to put in extra work to get torch.compile | working with my code, for understandable reasons. My library, | Vexpr, literally runs an interpreter inside of Python, reading | a big tree-like namedtuple-of-namedtuples "expression" data | structure and evaluating it recursively. That data structure | was way too fancy for torch.compile's guards, so I actually | wrote code [1] that converts a Vexpr expression into a big | Python code string and evals it, factoring the interpreter out | of the code, then I pass _that_ eval 'd string into | torch.compile. | | One torch.compile capability I would be excited to see is | compatibility with torch.vmap. One selling point of Vexpr is | that you can use vmap with it, so I was sad when I found I | couldn't use vmap and still support torch.compile. This made me | convert a bunch of my GP kernels [2] to be batch-aware. (This | missing capability is also understandable -- both vmap and | compile are new.) | | Anyway, I'm a fan of what y'all are doing! | | [1] | https://github.com/outergroup/vexpr/blob/e732e034768443386f9... | [2] https://github.com/outergroup/outer-loop- | cookbook/blob/5d94c... | bcoates wrote: | "For example, what if the parallel sums are of different lengths? | On GPUs, fast parallel reductions only work when inputs all have | the same length. [...] Vexpr's vectorizer groups the inputs by | length and performs a reduced number of operations--one for each | unique length." | | I'm surprised this is necessary, I thought modern vectorization | on both CPU and GPU handled heterogenous vectorization cases like | this handily with conditional execution (on SMT GPUs) or mask | registers (on SIMD CPUs) ___________________________________________________________________ (page generated 2023-10-26 23:00 UTC)