[HN Gopher] What I Wish Someone Had Told Me About Tensor Computa...
       ___________________________________________________________________
        
       What I Wish Someone Had Told Me About Tensor Computation Libraries
        
       Author : _eigenfoo
       Score  : 75 points
       Date   : 2020-12-15 20:46 UTC (2 hours ago)
        
 (HTM) web link (eigenfoo.xyz)
 (TXT) w3m dump (eigenfoo.xyz)
        
       | 37ef_ced3 wrote:
       | NN-512 (https://NN-512.com)
       | 
       | Generate fully vectorized, stand-alone, human-readable C99 code
       | for neural net inference, and understand exactly what's
       | happening. For example, watch the code run with Linux's perf top
       | and see the relative costs of each layer of the computation.
       | Total transparency, no dependencies outside the C POSIX library
        
         | DSingularity wrote:
         | Yummy. Thanks. Gonna bookmark that one.
        
         | joshuamorton wrote:
         | In what sense is this "better"?
         | 
         | The generated code is like                   __m512i wfs16 =
         | _mm512_castsi256_si512(_mm512_cvtps_ph(wf25,
         | _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC));         fs16 =
         | _mm512_inserti64x4(wfs16, _mm512_cvtps_ph(wf26,
         | _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC), 1);         _mm51
         | 2_mask_storeu_epi32(wfPtr1+230400+38400*i5+768*c2+128*k1+64*m2+
         | 16*f3, 3855, wfs16);         _mm512_mask_storeu_epi32(wfPtr1+34
         | 5584+38400*i5+768*c2+128*k1+64*m2+16*f3, 61680, wfs16);
         | 
         | (which is a set of 4 lines that appear in the middle of an ~800
         | line function).
         | 
         | That's not "human readable".
         | 
         | Sure you can use asan or gdb, but if gdb profiles slowly, what
         | can you _do_? You 're still at the mercy of the code generator
         | to be able to optimize things.
        
           | 37ef_ced3 wrote:
           | The inference function (at the end of the C file) is a series
           | of blocks, each block corresponding to a convolution or other
           | complex operation. It's straightforward to see which, by
           | looking at where the weights come from (a field in a struct
           | that has the same name as the layer in your graph)
           | 
           | If you use perf top (for example) you can see which
           | convolution was most expensive, any why. Does the shape of
           | the tensor produce many small partial blocks around the edge,
           | so the packing is inefficient (a lot of tile overhang), for
           | example? You can see that by glancing at the code and seeing
           | that there are many optimized blocks around the edges
           | 
           | As a rule, if NN-512 generates small code for a tensor (few
           | edge cases) you have chosen an efficient tensor shape, with
           | respect to the tile
           | 
           | Or you might find that batch normalization is being done at
           | inference time (as in DenseNet), instead of being integrated
           | into the convolution weights (as in ResNet), because there's
           | fanout from the source and a ReLU in between. You can see
           | that in the code
           | 
           | Is the matrix multiplication slow because there are too few
           | channels per group (as in ResNeXt)? Easy to see in perf, make
           | your groups bigger. Are you using an inefficient filter
           | shape, so we have to fall back to a slower general purpose
           | convolution? You can easily see whether Winograd or Fourier
           | was used
           | 
           | I agree, if you don't know anything about how convolution is
           | implemented (filter packing, data packing, matrix
           | multiplication, sum unpacking), you could be lost. But it's
           | very shallow compared to a JIT or CUDA library scheme, and a
           | knowledgeable ML performance engineer would have no
           | difficulty
        
             | akhilcacharya wrote:
             | I'm truly baffled as to why such a sophisticated and useful
             | package is being distributed and advertised by an anonymous
             | individual.
        
       | Const-me wrote:
       | I wonder does any of them have proper Windows support, i.e.
       | DirectCompute?
       | 
       | CUDA is NVidia only and vendor lock in is bad for end users. Both
       | CUDA, OpenCL and VK require large runtimes which are not included
       | in the OS, software vendors like me need to redistribute and
       | support it, I tend to avoid deploying libraries when I can.
        
       | yongjik wrote:
       | > with dynamically generated graphs, the computational graph is
       | never actually defined anywhere: the computation is traced out on
       | the fly and behind the scene. You can no longer do anything
       | interesting with the computational graph: for example, if the
       | computation is slow, you can't reason about what parts of the
       | graph are slow.
       | 
       | Hmm, my experience is the opposite. When I used Tensorflow, there
       | was no way I could figure out why something is slow, or require
       | huge memory. All I have is a gigantic black box.
       | 
       | Meanwhile, in PyTorch, all I have to do is run it with
       | CUDA_LAUNCH_BLOCKING=1, and it will give me an accurate picture
       | of exactly how much milliseconds each line is taking! (Just print
       | the current time before/after the line.) With nvprof it will even
       | tell you which CUDA kernels are executing.
       | 
       | * Disclaimer: Haven't dabbled in ML for ~a year, so my view might
       | be outdated now.
        
         | whimsicalism wrote:
         | Eh. I love pytorch, but it can definitely be difficult to
         | reason about at times. For instance, due to async dispatch on
         | GPU, you could get assertion errors where a line fails, but the
         | real error was actually several lines above.
         | 
         | That was difficult to reason about.
        
           | atorodius wrote:
           | Wouldnt this be fixed by CUDA_LAUNCH_BLOCKING=1? Or putting a
           | bunch of torch.cuda.synchronizes in the suspected lines.
        
             | whimsicalism wrote:
             | lol whoops yeah that would definitely solve the problem. I
             | wasn't familiar with `CUDA_LAUNCH_BLOCKING` but
             | `torch.cuda.synchronizes` does work.
        
       | cygaril wrote:
       | Seems to have missed the existence of jax.jit, which basically
       | constructs an XLA program (call it a graph if you like) from your
       | Python function which can then be optimized.
        
         | easde wrote:
         | TorchScript JIT (torch.jit.script) is similar for PyTorch.
        
           | komuher wrote:
           | Not even cloese, jax.jit allow you to compute almost anything
           | using lax.for_loops, lax.cond and other lax and jax
           | contsturts pytorch jit does not allow that its just extra
           | optimization for static pytorch functions.
        
             | hyperbovine wrote:
             | No autodiff for most of these though.
        
         | JHonaker wrote:
         | In the section title, JAX:
         | 
         | > But JAX even lets you just-in-time compile your own Python
         | functions into XLA-optimized kernels...
        
       | dragandj wrote:
       | Let me chip in with some self-promotion.
       | 
       | This book explains and executes _every single line_ of code
       | interactively, from low level operations to high-level networks
       | that do everything automatically. The code is built on the state
       | of the art performance operations of oneDNN (Intel, CPU) and
       | cuDNN (CUDA, GPU). Very concise _readable and understandable_ by
       | humans.
       | 
       | https://aiprobook.com/deep-learning-for-programmers/
       | 
       | Here's the open source library built throughout the book:
       | 
       | https://github.com/uncomplicate/deep-diamond
       | 
       | Some chapters from the beginning of the book are available on my
       | blog, as a tutorial series:
       | 
       | https://dragan.rocks
        
       | cmarschner wrote:
       | Tensorflow 1.0 has its roots in how Theano was built. Same thing,
       | a statically built graph that is run through a compilation step,
       | with a numpy-like API. So what makes Theano such an ingenious
       | concept while TF is regarded as "programming through a keyhole"?
        
       ___________________________________________________________________
       (page generated 2020-12-15 23:00 UTC)