[HN Gopher] Will Floating Point 8 Solve AI/ML Overhead? ___________________________________________________________________ Will Floating Point 8 Solve AI/ML Overhead? Author : rbanffy Score : 40 points Date : 2023-01-15 21:51 UTC (1 days ago) (HTM) web link (semiengineering.com) (TXT) w3m dump (semiengineering.com) | jasonjmcghee wrote: | I am curious, if folks have tried hybrid methods using ensembles. | | Train the main model using FP8 (or other quantized approaches) | and have a small calibrating model at FP32 that is trained | afterward. | thriftwy wrote: | Why go so extreme when you can have fp12? Perhaps have 4 high bit | exponent and low signed int8 mantissa. | | Or vice versa, 7 bit exponent, sign and 4 bit mantissa. | jasonjmcghee wrote: | I think the general idea is to make use of SIMD, and generally | the max size is 2^x. So if you're trying to multiply as many | numbers as possible in, say, 64 bits, FP8 would get you 8x8 and | FP16 would get you 4x4. FP12 would get you 5x5 with some unused | space which would be a huge amount of extra work to implement | for a 20% gain in efficiency. | sj4nz wrote: | Could be 8-bit posits may be enough. Has that been done? At | scale. I do not know. | bigbillheck wrote: | Posits aren't the answer to any question worth asking. | meltyness wrote: | Why not? | SideQuark wrote: | Original posits are variable width, making them nearly | useless for high performance parallel computations. Later | versions don't add anything of use for low precision neural | networks, and lack of hardware support anywhere make them | too slow for anything other than toying around. | | See also | http://people.eecs.berkeley.edu/~wkahan/UnumSORN.pdf and | https://www.youtube.com/watch?v=LZAeZBVAzVw | moloch-hai wrote: | > lack of hardware support | | That seems fixable. Don't people make chips that do what | you want a lot of? A chip with an array of 8-bit posit | PUs could process a hell of a lot in parallel, subject | only to getting the arguments and results to useful | places. | ansk wrote: | Will Doubling Disk Size Solve Storage? | wolfram74 wrote: | I feel like the fastest tier of ram can only get so big before | speed of light delays become relevant. | visarga wrote: | The practical question of interest is: will this make it | possible to run GPT-3 size models on normal desktops with GPU? | Like Stable Diffusion. | SoftTalker wrote: | At some point the practical question would be how do you get | all the data onto the desktop. | Dylan16807 wrote: | People are already downloading 100GB games. And data rates | are growing much faster than RAM capacities. The logistics | of downloading a model smaller than GPU RAM are unlikely to | ever get complicated. | eklitzke wrote: | Kind of a weird article, as 8-bit quantization is widely used in | production for neural networks for a number of years now. The | title of the article is a bit misleading since it's widely known | that 8-bit quantization does work and is extremely effective at | improving inference throughput and latency. I'm not 100% sure if | I'm reading the article correctly since it's a bit oblique, but | it seems like the news here is that work is being done to | formally specify a cross-vendor FP8 standard, as what exists | right now is de facto standards from different vendors. | FL33TW00D wrote: | INT8 quantisation has been used in production for years. FP8 | has not. | Dylan16807 wrote: | FP8 provides some nice accuracy benefits over INT8 but if you | swap it out that doesn't affect your overhead. | ftufek wrote: | The article mentions the 8 bit quantization, I believe this is | about training in fp8 as native format. The latest GPUs provide | huge flops for those, Tim Dettmers updated his gpu article and | he talks about this, the claim is 0.66 PFLOPS for an RTX 4090. | make3 wrote: | The title is nonsensical. The faster the compute is, or the | faster inference is (through eg precision), the larger the models | people will train, because accuracy / output quality increases | indefinitely with model size, and everyone knows this. So a | different precision it will not "Solve the AI/ML Overhead", | that's nonsense. People will just use as large a model as they | can for their latency budget at inference & for their $ budget at | training, whatever it is. | gumby wrote: | Really for me just the mantissa would be fine; no need for | exponent bc so much of what I worked on is between 0..1 | | There was an interesting paper from the Allen Institute a few | years ago describing a system with 1 bit weights that worked | pretty well! Since I read it I've been musing on trying that, | though it seems unlikely I will be able to any time soon. | [deleted] | thfuran wrote: | If you just have a mantissa, aren't you doing fixed point math? | gumby wrote: | Yes, just looking for a weight in the range 0 <= x < 1. But I | want to do large numbers of calculations using the GPU, else | I'd use the SIMD int instructions (AVX) | snickerbockers wrote: | Just do fixed point bruh. | gumby wrote: | it is, but doesn't give me the hardware affordance I want: | https://news.ycombinator.com/item?id=34405604 | voz_ wrote: | " High on the ML punch list is how to run models more efficiently | using less power, especially in critical applications like self- | driving vehicles where latency becomes a matter of life or | death." | | Never ever heard of inference latency being a bottleneck here... | amelius wrote: | > People who follow a strict neuromorphic interpretation have | even discussed binary neural networks, in which the input | functions like an axon spike, just 0 or 1. | | How do you perform differentiation with this datatype? | _0ffh wrote: | The article, comparing single and double precision: | | >the mantissa jumps from 32 bits to 52 bits | | Rather from 23 (+1 for implicit msb) to 52 (+), I suppose. | amelius wrote: | In the old days of CS, people were talking about optimizations in | the big-O sense. | | Nowadays the talk is mostly about optimization of constant | factors, so it seems. | kortex wrote: | Related: | | https://ai.facebook.com/blog/making-floating-point-math-high... | | Which is meta's 8 bit data type originally called (8,1, alpha, | beta, gamma). I think they realized that's a terrible name so I | think they are calling it Deepfloat or something now. | [deleted] | fswd wrote: | For LLM, INT8 is old news but still exciting. FP8 would | definitely be an improvement. However the new coolness is INT4. | | > Excitingly, we manage to reach the INT4 weight quantization for | GLM-130B while existing successes have thus far only come to the | INT8 level. Memory-wise, by comparing to INT8, the INT4 version | helps additionally save half of the required GPU memory to 70GB, | thus allowing GLM130B inference on 4 x RTX 3090 Ti (24G) or 8 x | RTX 2080 Ti (11G). Performance-wise, Table 2 left indicates that | without post-training at all, the INT4-version GLM-130B | experiences almost no performance degradation, thus maintaining | the advantages over GPT-3 on common benchmarks. | | Page 7 https://arxiv.org/pdf/2210.02414.pdf | cypress66 wrote: | Hopper seems to drop int4 support so maybe it's old news now? | | https://en.m.wikipedia.org/wiki/Hopper_(microarchitecture) | dragontamer wrote: | At this rate, we're going to end up with FP1 (1-bit floating | point) numbers... | | I guess that's nonsensical. 1-bit in all FP is the sign bit. So | I guess the minimum size is 2-bit FP (1-bit sign + 1-bit | exponent + 0-bit implicit 1 mantissa) | kortex wrote: | At FP2, you are probably better off with {-1, 0, 1, NaN} | (sign+mantissa) rather than sign/exponent. You basically bit | pack. | | FP3 gives you sign, 1x "exponent", 1x mantissa, so still | kinda bit packing. | | I could see FP4 with sign, 1x exponent, 2x mantissa. Exponent | would really just be a 4x multiplier, giving | +/-0,1,2,3,4,8,12 | | Or invert all those, so you are expressing common fractions | on 0..1 | dragontamer wrote: | Real life has the E3 series: 1, 2.2, 4.7, and then 10, 22, | 47, 100, 220, 470, 1000, etc. Etc. | | EEs would recognize these values to be the Preferred | Resistors values for projects (though more commonly the E6 | series is used in projects, the E3 and E1 values are | preferred) | | That's 3 values per decade, which is slightly more | dispersed than a FP4 that consists of 1 sign + 3 exponent + | 0 (implicit mantissa 1 bit). | | Or the values -128, -64, -32, -16, -8, -4, -2, -1, 1, 2, | ... 128. | | Maybe we can take -128 and call that zero instead, cause | zero is useful. | | -------- | | Given how even E3 is still useful in real world electrical | engineering problems, I'm more inclined to allocate more | bits to the exponent than the mantissa. | ben_w wrote: | > Real life has the E3 series: 1, 2.2, 4.7, and then 10, | 22, 47, 100, 220, 470, 1000, etc. Etc. | | Took me until today to realise that sequence is a rounded | version of 10^(n/3) for integer n. | Dylan16807 wrote: | If you're going to bother doing floats you should probably | make them balanced around 1. | | And exponent seems to be much more important for these | small sizes. The first paper that shows up for FP4 almost | has negative mantissa bits. Their encoding has 0, 1/64, | 1/16, 1/4, 1, 4, 16, 64. | varispeed wrote: | That once we get into asymmetrical number coding so that you | could use numbers that take fraction of bits. | dimatura wrote: | Binary neural networks, where weights and/or activations are | just 0/1s, are an active research area. In theory they could | be implemented very efficiently in hardware. But in contrast | to FP16 (or to some extent, int8), just quantizing FP32 to 1 | bit doesn't work very well. There have been successful | methods in practice. There was a company called Xnor.ai that | was built partially around this technology, but it was sold | to Apple a couple years ago. I don't know what's the current | SOTA in this area, though. | SideQuark wrote: | 1 bit would work fine - make the values represent +-1 or so | visarga wrote: | I think I read somewhere it only goes as low as int4. Can't | find the reference. ___________________________________________________________________ (page generated 2023-01-16 23:00 UTC)