[HN Gopher] Bolt: Faster matrix and vector operations that run o...
       ___________________________________________________________________
        
       Bolt: Faster matrix and vector operations that run on compressed
       data
        
       Author : febin
       Score  : 81 points
       Date   : 2022-06-18 18:01 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | raxxorraxor wrote:
       | This looks good. Why do the vectors have to be dense? Just
       | because of overhead/speed gain being the lowest? Just asking if
       | you could use it universally for all operations if I don't know
       | the density.
        
         | anderskaseorg wrote:
         | If your data is represented as sparse vectors, that sparse
         | representation is already compressed. You wouldn't want to
         | decompress it to a dense representation just to apply a
         | different, less effective, lossy compression algorithm. That
         | would be like taking a screenshot of a paragraph of text so you
         | can post a JPEG of it to Twitter.
         | 
         | Oh, wait.
        
       | jansan wrote:
       | THis sounds and looks impressive, but this part struck me:
       | 
       | "If you ... and can tolerate lossy compression"
       | 
       | What does this mean? I wouldn't have thought that matrix
       | operations can be lossy. Does anybody know to what extend they
       | are lossy and where this would be acceptable?
        
         | epistasis wrote:
         | The matrix itself is the operation, too, it's a function from
         | R^n to R^m, and that function is what is approximated by matrix
         | compression.
         | 
         | If you are familiar with PCA or SVD, you are already close to
         | understanding a basic form of compression. SVD breaks down an m
         | x n matrix into an nxn rotation matrix, an nxn diagonal scaling
         | matrix, and a n mxn loadings matrix. The ordering of the new
         | matrices is usually with the highest amount of variability
         | described first. So if you take the first, say, 5 of the n
         | dimensions, and only use them, you can reconstruct an
         | approximation of the original matrix that uses approximately
         | 5/n of the original storage.
         | 
         | PCA is often used in machine learning too, and there is such a
         | deep connection between compression that is hard to make
         | explicit or formalize.
         | 
         | The implementation linked here uses Vector Quantization
         | instead:
         | 
         | https://en.wikipedia.org/wiki/Vector_quantization
        
           | jansan wrote:
           | The furthest I have come was applying an SVD on a 2D Matrix
           | to find the new axes of a skewed ellipsis, but I think I
           | could follow for most of the part. Thanks.
        
         | Iv wrote:
         | In almost all practical uses of matrix multiplication, we have
         | rounding errors. For example, in 3D it is hard to reverse
         | exactly a rotation and get the exact initial position back.
         | 
         | I don't know what amount of losses we are talking about but in
         | deep learning, several operations don't require a crazy level
         | of compression, and it led to some lightweight float
         | implementations (bfloat, on 16 bits, being the most common but
         | there are also 8 bits floats for extreme cases)
         | 
         | If that's really a 10-100x speed increase at the cost of a bit
         | of loss, I am sure machine learning will love it.
        
           | [deleted]
        
       | nynx wrote:
       | Wow, this is fascinating. I wonder if hardware could be designed
       | to do this really efficiently.
        
         | skohan wrote:
         | It already is right? A GPU is basically a purpose-built linear
         | algebra machine.
        
           | mandarax8 wrote:
           | From the abstract:
           | 
           | > (In the common case that one matrix is known ahead of
           | time,) our method also has the in- teresting property that it
           | requires zero multiply-adds. These results suggest that a
           | mixture of hashing, aver- aging, and byte shuffling---the
           | core operations of our method---could be a more promising
           | building block for machine learning than the sparsified,
           | factorized, and/or scalar quantized matrix products that have
           | re- cently been the focus of substantial research and hard-
           | ware investment.`
           | 
           | This is not at all what modern gpus are optimized for.
        
       | bee_rider wrote:
       | I guess the naive approach, if we wanted to do a quick lossy
       | matrix multipy, would be to take the truncated SVD and use that.
       | How does this library compare to the boring strategy, I wonder?
        
       | Iv wrote:
       | > If you have a large collection of mostly-dense vectors and can
       | tolerate lossy compression, Bolt can probably save you 10-200x
       | space and compute time.
       | 
       | Space. It can save space.
       | 
       | The main limitation of fast ML models nowadays is how much
       | parameters you can load in your GPU memory, and these are usually
       | matrices.
       | 
       | 200x would allow me to run GPT-3 on my old GTX 1050.
       | 
       | Frameworks, please implement this NOW!
        
       | cgreerrun wrote:
       | Maddness is their more recent work and yields 100x speedups:
       | https://arxiv.org/pdf/2106.10860.pdf
       | 
       | The code for Maddness is in the same github repo if you search
       | for "Mithral".
       | 
       | SIMD instructions can work wonders in the right context.
        
         | skohan wrote:
         | It's incredible that there's actually this much room to
         | improve. How does this compare to GPU implementations?
         | 
         | Also it looks like the optimization is related to running
         | operations on a compressed representation, for the 10x vs 100x
         | speedup, is there a tradeoff between speed and accuracy, or is
         | that extra degree of magnitude just from bringing SIMD into the
         | picture?
        
           | jacobolus wrote:
           | From what I can tell, this is a machine learning based
           | approximation to matrix multiplication by a particular matrix
           | (which it was trained on). It trades accuracy for speed. If
           | you need to multiply many (many!) vectors by a static matrix
           | and you have loose enough error tolerance, this can provide
           | up to 100x speedup.
        
       | Iv wrote:
       | This is actually from a paper published last year:
       | 
       | https://www.reddit.com/r/MachineLearning/comments/pffoo8/r_m...
       | 
       | A few questions:
       | 
       | - Do some ML frameworks implement it already? - It promises up to
       | 200x compression, is it reasonable to expect it to allow us to
       | run GPT-3 on smaller mainstream GPUs?
        
       ___________________________________________________________________
       (page generated 2022-06-18 23:00 UTC)