[HN Gopher] Will Floating Point 8 Solve AI/ML Overhead?
       ___________________________________________________________________
        
       Will Floating Point 8 Solve AI/ML Overhead?
        
       Author : rbanffy
       Score  : 40 points
       Date   : 2023-01-15 21:51 UTC (1 days ago)
        
 (HTM) web link (semiengineering.com)
 (TXT) w3m dump (semiengineering.com)
        
       | jasonjmcghee wrote:
       | I am curious, if folks have tried hybrid methods using ensembles.
       | 
       | Train the main model using FP8 (or other quantized approaches)
       | and have a small calibrating model at FP32 that is trained
       | afterward.
        
       | thriftwy wrote:
       | Why go so extreme when you can have fp12? Perhaps have 4 high bit
       | exponent and low signed int8 mantissa.
       | 
       | Or vice versa, 7 bit exponent, sign and 4 bit mantissa.
        
         | jasonjmcghee wrote:
         | I think the general idea is to make use of SIMD, and generally
         | the max size is 2^x. So if you're trying to multiply as many
         | numbers as possible in, say, 64 bits, FP8 would get you 8x8 and
         | FP16 would get you 4x4. FP12 would get you 5x5 with some unused
         | space which would be a huge amount of extra work to implement
         | for a 20% gain in efficiency.
        
       | sj4nz wrote:
       | Could be 8-bit posits may be enough. Has that been done? At
       | scale. I do not know.
        
         | bigbillheck wrote:
         | Posits aren't the answer to any question worth asking.
        
           | meltyness wrote:
           | Why not?
        
             | SideQuark wrote:
             | Original posits are variable width, making them nearly
             | useless for high performance parallel computations. Later
             | versions don't add anything of use for low precision neural
             | networks, and lack of hardware support anywhere make them
             | too slow for anything other than toying around.
             | 
             | See also
             | http://people.eecs.berkeley.edu/~wkahan/UnumSORN.pdf and
             | https://www.youtube.com/watch?v=LZAeZBVAzVw
        
               | moloch-hai wrote:
               | > lack of hardware support
               | 
               | That seems fixable. Don't people make chips that do what
               | you want a lot of? A chip with an array of 8-bit posit
               | PUs could process a hell of a lot in parallel, subject
               | only to getting the arguments and results to useful
               | places.
        
       | ansk wrote:
       | Will Doubling Disk Size Solve Storage?
        
         | wolfram74 wrote:
         | I feel like the fastest tier of ram can only get so big before
         | speed of light delays become relevant.
        
         | visarga wrote:
         | The practical question of interest is: will this make it
         | possible to run GPT-3 size models on normal desktops with GPU?
         | Like Stable Diffusion.
        
           | SoftTalker wrote:
           | At some point the practical question would be how do you get
           | all the data onto the desktop.
        
             | Dylan16807 wrote:
             | People are already downloading 100GB games. And data rates
             | are growing much faster than RAM capacities. The logistics
             | of downloading a model smaller than GPU RAM are unlikely to
             | ever get complicated.
        
       | eklitzke wrote:
       | Kind of a weird article, as 8-bit quantization is widely used in
       | production for neural networks for a number of years now. The
       | title of the article is a bit misleading since it's widely known
       | that 8-bit quantization does work and is extremely effective at
       | improving inference throughput and latency. I'm not 100% sure if
       | I'm reading the article correctly since it's a bit oblique, but
       | it seems like the news here is that work is being done to
       | formally specify a cross-vendor FP8 standard, as what exists
       | right now is de facto standards from different vendors.
        
         | FL33TW00D wrote:
         | INT8 quantisation has been used in production for years. FP8
         | has not.
        
           | Dylan16807 wrote:
           | FP8 provides some nice accuracy benefits over INT8 but if you
           | swap it out that doesn't affect your overhead.
        
         | ftufek wrote:
         | The article mentions the 8 bit quantization, I believe this is
         | about training in fp8 as native format. The latest GPUs provide
         | huge flops for those, Tim Dettmers updated his gpu article and
         | he talks about this, the claim is 0.66 PFLOPS for an RTX 4090.
        
       | make3 wrote:
       | The title is nonsensical. The faster the compute is, or the
       | faster inference is (through eg precision), the larger the models
       | people will train, because accuracy / output quality increases
       | indefinitely with model size, and everyone knows this. So a
       | different precision it will not "Solve the AI/ML Overhead",
       | that's nonsense. People will just use as large a model as they
       | can for their latency budget at inference & for their $ budget at
       | training, whatever it is.
        
       | gumby wrote:
       | Really for me just the mantissa would be fine; no need for
       | exponent bc so much of what I worked on is between 0..1
       | 
       | There was an interesting paper from the Allen Institute a few
       | years ago describing a system with 1 bit weights that worked
       | pretty well! Since I read it I've been musing on trying that,
       | though it seems unlikely I will be able to any time soon.
        
         | [deleted]
        
         | thfuran wrote:
         | If you just have a mantissa, aren't you doing fixed point math?
        
           | gumby wrote:
           | Yes, just looking for a weight in the range 0 <= x < 1. But I
           | want to do large numbers of calculations using the GPU, else
           | I'd use the SIMD int instructions (AVX)
        
         | snickerbockers wrote:
         | Just do fixed point bruh.
        
           | gumby wrote:
           | it is, but doesn't give me the hardware affordance I want:
           | https://news.ycombinator.com/item?id=34405604
        
       | voz_ wrote:
       | " High on the ML punch list is how to run models more efficiently
       | using less power, especially in critical applications like self-
       | driving vehicles where latency becomes a matter of life or
       | death."
       | 
       | Never ever heard of inference latency being a bottleneck here...
        
       | amelius wrote:
       | > People who follow a strict neuromorphic interpretation have
       | even discussed binary neural networks, in which the input
       | functions like an axon spike, just 0 or 1.
       | 
       | How do you perform differentiation with this datatype?
        
       | _0ffh wrote:
       | The article, comparing single and double precision:
       | 
       | >the mantissa jumps from 32 bits to 52 bits
       | 
       | Rather from 23 (+1 for implicit msb) to 52 (+), I suppose.
        
       | amelius wrote:
       | In the old days of CS, people were talking about optimizations in
       | the big-O sense.
       | 
       | Nowadays the talk is mostly about optimization of constant
       | factors, so it seems.
        
       | kortex wrote:
       | Related:
       | 
       | https://ai.facebook.com/blog/making-floating-point-math-high...
       | 
       | Which is meta's 8 bit data type originally called (8,1, alpha,
       | beta, gamma). I think they realized that's a terrible name so I
       | think they are calling it Deepfloat or something now.
        
       | [deleted]
        
       | fswd wrote:
       | For LLM, INT8 is old news but still exciting. FP8 would
       | definitely be an improvement. However the new coolness is INT4.
       | 
       | > Excitingly, we manage to reach the INT4 weight quantization for
       | GLM-130B while existing successes have thus far only come to the
       | INT8 level. Memory-wise, by comparing to INT8, the INT4 version
       | helps additionally save half of the required GPU memory to 70GB,
       | thus allowing GLM130B inference on 4 x RTX 3090 Ti (24G) or 8 x
       | RTX 2080 Ti (11G). Performance-wise, Table 2 left indicates that
       | without post-training at all, the INT4-version GLM-130B
       | experiences almost no performance degradation, thus maintaining
       | the advantages over GPT-3 on common benchmarks.
       | 
       | Page 7 https://arxiv.org/pdf/2210.02414.pdf
        
         | cypress66 wrote:
         | Hopper seems to drop int4 support so maybe it's old news now?
         | 
         | https://en.m.wikipedia.org/wiki/Hopper_(microarchitecture)
        
         | dragontamer wrote:
         | At this rate, we're going to end up with FP1 (1-bit floating
         | point) numbers...
         | 
         | I guess that's nonsensical. 1-bit in all FP is the sign bit. So
         | I guess the minimum size is 2-bit FP (1-bit sign + 1-bit
         | exponent + 0-bit implicit 1 mantissa)
        
           | kortex wrote:
           | At FP2, you are probably better off with {-1, 0, 1, NaN}
           | (sign+mantissa) rather than sign/exponent. You basically bit
           | pack.
           | 
           | FP3 gives you sign, 1x "exponent", 1x mantissa, so still
           | kinda bit packing.
           | 
           | I could see FP4 with sign, 1x exponent, 2x mantissa. Exponent
           | would really just be a 4x multiplier, giving
           | +/-0,1,2,3,4,8,12
           | 
           | Or invert all those, so you are expressing common fractions
           | on 0..1
        
             | dragontamer wrote:
             | Real life has the E3 series: 1, 2.2, 4.7, and then 10, 22,
             | 47, 100, 220, 470, 1000, etc. Etc.
             | 
             | EEs would recognize these values to be the Preferred
             | Resistors values for projects (though more commonly the E6
             | series is used in projects, the E3 and E1 values are
             | preferred)
             | 
             | That's 3 values per decade, which is slightly more
             | dispersed than a FP4 that consists of 1 sign + 3 exponent +
             | 0 (implicit mantissa 1 bit).
             | 
             | Or the values -128, -64, -32, -16, -8, -4, -2, -1, 1, 2,
             | ... 128.
             | 
             | Maybe we can take -128 and call that zero instead, cause
             | zero is useful.
             | 
             | --------
             | 
             | Given how even E3 is still useful in real world electrical
             | engineering problems, I'm more inclined to allocate more
             | bits to the exponent than the mantissa.
        
               | ben_w wrote:
               | > Real life has the E3 series: 1, 2.2, 4.7, and then 10,
               | 22, 47, 100, 220, 470, 1000, etc. Etc.
               | 
               | Took me until today to realise that sequence is a rounded
               | version of 10^(n/3) for integer n.
        
             | Dylan16807 wrote:
             | If you're going to bother doing floats you should probably
             | make them balanced around 1.
             | 
             | And exponent seems to be much more important for these
             | small sizes. The first paper that shows up for FP4 almost
             | has negative mantissa bits. Their encoding has 0, 1/64,
             | 1/16, 1/4, 1, 4, 16, 64.
        
           | varispeed wrote:
           | That once we get into asymmetrical number coding so that you
           | could use numbers that take fraction of bits.
        
           | dimatura wrote:
           | Binary neural networks, where weights and/or activations are
           | just 0/1s, are an active research area. In theory they could
           | be implemented very efficiently in hardware. But in contrast
           | to FP16 (or to some extent, int8), just quantizing FP32 to 1
           | bit doesn't work very well. There have been successful
           | methods in practice. There was a company called Xnor.ai that
           | was built partially around this technology, but it was sold
           | to Apple a couple years ago. I don't know what's the current
           | SOTA in this area, though.
        
           | SideQuark wrote:
           | 1 bit would work fine - make the values represent +-1 or so
        
           | visarga wrote:
           | I think I read somewhere it only goes as low as int4. Can't
           | find the reference.
        
       ___________________________________________________________________
       (page generated 2023-01-16 23:00 UTC)