[HN Gopher] What I Wish Someone Had Told Me About Tensor Computa... ___________________________________________________________________ What I Wish Someone Had Told Me About Tensor Computation Libraries Author : _eigenfoo Score : 75 points Date : 2020-12-15 20:46 UTC (2 hours ago) (HTM) web link (eigenfoo.xyz) (TXT) w3m dump (eigenfoo.xyz) | 37ef_ced3 wrote: | NN-512 (https://NN-512.com) | | Generate fully vectorized, stand-alone, human-readable C99 code | for neural net inference, and understand exactly what's | happening. For example, watch the code run with Linux's perf top | and see the relative costs of each layer of the computation. | Total transparency, no dependencies outside the C POSIX library | DSingularity wrote: | Yummy. Thanks. Gonna bookmark that one. | joshuamorton wrote: | In what sense is this "better"? | | The generated code is like __m512i wfs16 = | _mm512_castsi256_si512(_mm512_cvtps_ph(wf25, | _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC)); fs16 = | _mm512_inserti64x4(wfs16, _mm512_cvtps_ph(wf26, | _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC), 1); _mm51 | 2_mask_storeu_epi32(wfPtr1+230400+38400*i5+768*c2+128*k1+64*m2+ | 16*f3, 3855, wfs16); _mm512_mask_storeu_epi32(wfPtr1+34 | 5584+38400*i5+768*c2+128*k1+64*m2+16*f3, 61680, wfs16); | | (which is a set of 4 lines that appear in the middle of an ~800 | line function). | | That's not "human readable". | | Sure you can use asan or gdb, but if gdb profiles slowly, what | can you _do_? You 're still at the mercy of the code generator | to be able to optimize things. | 37ef_ced3 wrote: | The inference function (at the end of the C file) is a series | of blocks, each block corresponding to a convolution or other | complex operation. It's straightforward to see which, by | looking at where the weights come from (a field in a struct | that has the same name as the layer in your graph) | | If you use perf top (for example) you can see which | convolution was most expensive, any why. Does the shape of | the tensor produce many small partial blocks around the edge, | so the packing is inefficient (a lot of tile overhang), for | example? You can see that by glancing at the code and seeing | that there are many optimized blocks around the edges | | As a rule, if NN-512 generates small code for a tensor (few | edge cases) you have chosen an efficient tensor shape, with | respect to the tile | | Or you might find that batch normalization is being done at | inference time (as in DenseNet), instead of being integrated | into the convolution weights (as in ResNet), because there's | fanout from the source and a ReLU in between. You can see | that in the code | | Is the matrix multiplication slow because there are too few | channels per group (as in ResNeXt)? Easy to see in perf, make | your groups bigger. Are you using an inefficient filter | shape, so we have to fall back to a slower general purpose | convolution? You can easily see whether Winograd or Fourier | was used | | I agree, if you don't know anything about how convolution is | implemented (filter packing, data packing, matrix | multiplication, sum unpacking), you could be lost. But it's | very shallow compared to a JIT or CUDA library scheme, and a | knowledgeable ML performance engineer would have no | difficulty | akhilcacharya wrote: | I'm truly baffled as to why such a sophisticated and useful | package is being distributed and advertised by an anonymous | individual. | Const-me wrote: | I wonder does any of them have proper Windows support, i.e. | DirectCompute? | | CUDA is NVidia only and vendor lock in is bad for end users. Both | CUDA, OpenCL and VK require large runtimes which are not included | in the OS, software vendors like me need to redistribute and | support it, I tend to avoid deploying libraries when I can. | yongjik wrote: | > with dynamically generated graphs, the computational graph is | never actually defined anywhere: the computation is traced out on | the fly and behind the scene. You can no longer do anything | interesting with the computational graph: for example, if the | computation is slow, you can't reason about what parts of the | graph are slow. | | Hmm, my experience is the opposite. When I used Tensorflow, there | was no way I could figure out why something is slow, or require | huge memory. All I have is a gigantic black box. | | Meanwhile, in PyTorch, all I have to do is run it with | CUDA_LAUNCH_BLOCKING=1, and it will give me an accurate picture | of exactly how much milliseconds each line is taking! (Just print | the current time before/after the line.) With nvprof it will even | tell you which CUDA kernels are executing. | | * Disclaimer: Haven't dabbled in ML for ~a year, so my view might | be outdated now. | whimsicalism wrote: | Eh. I love pytorch, but it can definitely be difficult to | reason about at times. For instance, due to async dispatch on | GPU, you could get assertion errors where a line fails, but the | real error was actually several lines above. | | That was difficult to reason about. | atorodius wrote: | Wouldnt this be fixed by CUDA_LAUNCH_BLOCKING=1? Or putting a | bunch of torch.cuda.synchronizes in the suspected lines. | whimsicalism wrote: | lol whoops yeah that would definitely solve the problem. I | wasn't familiar with `CUDA_LAUNCH_BLOCKING` but | `torch.cuda.synchronizes` does work. | cygaril wrote: | Seems to have missed the existence of jax.jit, which basically | constructs an XLA program (call it a graph if you like) from your | Python function which can then be optimized. | easde wrote: | TorchScript JIT (torch.jit.script) is similar for PyTorch. | komuher wrote: | Not even cloese, jax.jit allow you to compute almost anything | using lax.for_loops, lax.cond and other lax and jax | contsturts pytorch jit does not allow that its just extra | optimization for static pytorch functions. | hyperbovine wrote: | No autodiff for most of these though. | JHonaker wrote: | In the section title, JAX: | | > But JAX even lets you just-in-time compile your own Python | functions into XLA-optimized kernels... | dragandj wrote: | Let me chip in with some self-promotion. | | This book explains and executes _every single line_ of code | interactively, from low level operations to high-level networks | that do everything automatically. The code is built on the state | of the art performance operations of oneDNN (Intel, CPU) and | cuDNN (CUDA, GPU). Very concise _readable and understandable_ by | humans. | | https://aiprobook.com/deep-learning-for-programmers/ | | Here's the open source library built throughout the book: | | https://github.com/uncomplicate/deep-diamond | | Some chapters from the beginning of the book are available on my | blog, as a tutorial series: | | https://dragan.rocks | cmarschner wrote: | Tensorflow 1.0 has its roots in how Theano was built. Same thing, | a statically built graph that is run through a compilation step, | with a numpy-like API. So what makes Theano such an ingenious | concept while TF is regarded as "programming through a keyhole"? ___________________________________________________________________ (page generated 2020-12-15 23:00 UTC)