[HN Gopher] Bolt: Faster matrix and vector operations that run o... ___________________________________________________________________ Bolt: Faster matrix and vector operations that run on compressed data Author : febin Score : 81 points Date : 2022-06-18 18:01 UTC (4 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | raxxorraxor wrote: | This looks good. Why do the vectors have to be dense? Just | because of overhead/speed gain being the lowest? Just asking if | you could use it universally for all operations if I don't know | the density. | anderskaseorg wrote: | If your data is represented as sparse vectors, that sparse | representation is already compressed. You wouldn't want to | decompress it to a dense representation just to apply a | different, less effective, lossy compression algorithm. That | would be like taking a screenshot of a paragraph of text so you | can post a JPEG of it to Twitter. | | Oh, wait. | jansan wrote: | THis sounds and looks impressive, but this part struck me: | | "If you ... and can tolerate lossy compression" | | What does this mean? I wouldn't have thought that matrix | operations can be lossy. Does anybody know to what extend they | are lossy and where this would be acceptable? | epistasis wrote: | The matrix itself is the operation, too, it's a function from | R^n to R^m, and that function is what is approximated by matrix | compression. | | If you are familiar with PCA or SVD, you are already close to | understanding a basic form of compression. SVD breaks down an m | x n matrix into an nxn rotation matrix, an nxn diagonal scaling | matrix, and a n mxn loadings matrix. The ordering of the new | matrices is usually with the highest amount of variability | described first. So if you take the first, say, 5 of the n | dimensions, and only use them, you can reconstruct an | approximation of the original matrix that uses approximately | 5/n of the original storage. | | PCA is often used in machine learning too, and there is such a | deep connection between compression that is hard to make | explicit or formalize. | | The implementation linked here uses Vector Quantization | instead: | | https://en.wikipedia.org/wiki/Vector_quantization | jansan wrote: | The furthest I have come was applying an SVD on a 2D Matrix | to find the new axes of a skewed ellipsis, but I think I | could follow for most of the part. Thanks. | Iv wrote: | In almost all practical uses of matrix multiplication, we have | rounding errors. For example, in 3D it is hard to reverse | exactly a rotation and get the exact initial position back. | | I don't know what amount of losses we are talking about but in | deep learning, several operations don't require a crazy level | of compression, and it led to some lightweight float | implementations (bfloat, on 16 bits, being the most common but | there are also 8 bits floats for extreme cases) | | If that's really a 10-100x speed increase at the cost of a bit | of loss, I am sure machine learning will love it. | [deleted] | nynx wrote: | Wow, this is fascinating. I wonder if hardware could be designed | to do this really efficiently. | skohan wrote: | It already is right? A GPU is basically a purpose-built linear | algebra machine. | mandarax8 wrote: | From the abstract: | | > (In the common case that one matrix is known ahead of | time,) our method also has the in- teresting property that it | requires zero multiply-adds. These results suggest that a | mixture of hashing, aver- aging, and byte shuffling---the | core operations of our method---could be a more promising | building block for machine learning than the sparsified, | factorized, and/or scalar quantized matrix products that have | re- cently been the focus of substantial research and hard- | ware investment.` | | This is not at all what modern gpus are optimized for. | bee_rider wrote: | I guess the naive approach, if we wanted to do a quick lossy | matrix multipy, would be to take the truncated SVD and use that. | How does this library compare to the boring strategy, I wonder? | Iv wrote: | > If you have a large collection of mostly-dense vectors and can | tolerate lossy compression, Bolt can probably save you 10-200x | space and compute time. | | Space. It can save space. | | The main limitation of fast ML models nowadays is how much | parameters you can load in your GPU memory, and these are usually | matrices. | | 200x would allow me to run GPT-3 on my old GTX 1050. | | Frameworks, please implement this NOW! | cgreerrun wrote: | Maddness is their more recent work and yields 100x speedups: | https://arxiv.org/pdf/2106.10860.pdf | | The code for Maddness is in the same github repo if you search | for "Mithral". | | SIMD instructions can work wonders in the right context. | skohan wrote: | It's incredible that there's actually this much room to | improve. How does this compare to GPU implementations? | | Also it looks like the optimization is related to running | operations on a compressed representation, for the 10x vs 100x | speedup, is there a tradeoff between speed and accuracy, or is | that extra degree of magnitude just from bringing SIMD into the | picture? | jacobolus wrote: | From what I can tell, this is a machine learning based | approximation to matrix multiplication by a particular matrix | (which it was trained on). It trades accuracy for speed. If | you need to multiply many (many!) vectors by a static matrix | and you have loose enough error tolerance, this can provide | up to 100x speedup. | Iv wrote: | This is actually from a paper published last year: | | https://www.reddit.com/r/MachineLearning/comments/pffoo8/r_m... | | A few questions: | | - Do some ML frameworks implement it already? - It promises up to | 200x compression, is it reasonable to expect it to allow us to | run GPT-3 on smaller mainstream GPUs? ___________________________________________________________________ (page generated 2022-06-18 23:00 UTC)