[HN Gopher] Numbers every LLM developer should know ___________________________________________________________________ Numbers every LLM developer should know Author : richardliaw Score : 227 points Date : 2023-05-17 17:50 UTC (5 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | YetAnotherNick wrote: | > ~$1 million: Cost to train a 13 billion parameter model on 1.4 | trillion tokens | | Llama paper mentioned 135,168 A100 hours for training 13 billion | model on 1 trillion tokens, which means ~$150k for lambdalabs on | demand instance. | waleedk wrote: | [Author] Good luck trying to use clusters of Lambda machines. | Lambda labs are cheap for a reason: their API is not very | featureful (we looked at them and we saw they didn't even | support machine tagging). If you're looking for a box or two, | lambda labs is fine. If you're looking for 1,000, not so much. | | Plus they don't actually have any actually A100s available at | the moment (2022-05-17). | | CoreWeave is a nice middle ground. You can at least get the | A100 machines into a k8s cluster. | curiousgal wrote: | > _LLM Developer_ | | This is the fastest I've rolled my eyes in a long time! | ryanklee wrote: | The amount of get-off-my-lawn grognardness that LLM activity | inspires is really ridiculous. | | I really would ask you to take a second look at the spirit of | your comment and think carefully about how much you really | understand about the work being done on top of LLMs and if it | justifies this kind of response. | astrea wrote: | I had the same reaction as the OP. I'm not a data scientist | by trade or title, but I would personally be a little | offended. If you designed the Porsche 911, would you not be | offended by the shade tree mechanic who simply knows how to | change the oil calling himself a Porsche designer/engineer? | RyanCavanaugh wrote: | Context matters. Is a "web developer" someone who makes web | pages, or works on a browser rendering engine? | ryanklee wrote: | There are people making applications based on LLMs. You may | quibble with the term LLM Developer, but to sneer or roll | your eyes at it as if it were prima facie inaccurate or | laughable is unjustified. | cwkoss wrote: | Are there any open source host-your-own LLMs that have licensing | that allows for commercial use? | Der_Einzige wrote: | Dolly from Databricks is one at least | waleedk wrote: | [Author] TL;DR OS LLM models are coming. | | Dolly's not that great -- I've hit lots of issues using it to | be honest . | | MosaicML has a nice commercially usable model here: | https://www.mosaicml.com/blog/mpt-7b | | I think they're one of the leading ones (bias: they're kinda | competitors to my employer Anyscale, but you gotta say | something's good when it is). | | Red Pajama are leading an effort to build a fully open source | model similar to LLaMa. | https://www.together.xyz/blog/redpajama | int_19h wrote: | https://github.com/BlinkDL/RWKV-LM | elorant wrote: | Vicuna-13b is on Apache License 2.0. | twbarr wrote: | Vicuna is a delta model that you have to apply on top of | LLaMA. | throwaway888abc wrote: | Excellent! Thank you so much for making/posting this | waleedk wrote: | [Author] You're welcome -- glad it was useful! | MacsHeadroom wrote: | > Of course there are efforts to reduce this, notably llama.cpp | which runs a 13 billion parameter model on a 6GB GPU by | quantizing aggressively down to 4 bits (and 8 bits without too | much impact), but that's atypical. | | No, 4bit quantization is the typical case. | | At 4bit you can fit twice the parameters of 8bit in the same | space for far better performance/perplexity/quality. | | Running LLMs higher than 4bit is atypical and almost always sub- | optimal (compared to running a model half the size in 8bit). | | Even pretraining and finetuning in 4bit is likely to become the | norm soon as fp4 becomes more well understood. | moffkalast wrote: | > llama.cpp which runs a 13 billion parameter model on a 6GB | GPU | | I think that's a typo there too, the 13B model needs like 10G | of memory for 4 bits, it's the 7B one that fits into 6G. Well | unless you do the split thing with some layers on the CPU I | guess. | Der_Einzige wrote: | No it isn't, quantization is not free. You lose a significant | amount of performance that you are not measuring properly in | automated benchmarks when you quantize to that level. | | You can see it in real time when you take most LLMs and compare | them at different quantization levels. I can see the | degradation even in the largest llama quite badly even at 8 | bits. | astrange wrote: | If you take a model and quantize it it's obviously going to | get worse, but what if you train it again after that? | MacsHeadroom wrote: | Quantization is not free, but VRAM is even less free. | | If you have X amount of VRAM and can fit a 16bit model of | size 2X in 8bit or a model of size 4X in 4bit then the 4X | model in 4bit is ALWAYS superior with lower perplexity and | better performance. | | You LOSE performance by using a smaller model in 8bit vs a | larger model in 4bit. | waleedk wrote: | [Author] Completely disagree. Any analysis shows that you see | perplexity reduction at 4 bits. Have a look at llama.cpp's | results here: | | https://github.com/ggerganov/llama.cpp#quantization | | 4 bit has a perplexity score 0.13 or so higher. | MacsHeadroom wrote: | You're just wrong. You're looking at the wrong numbers. The | perplexity score of a model with twice the parameters in half | the bits (4bit) is FAR LOWER (ie better). | | If you are limited to X RAM and have two 16bit models of size | 4X and 2X then the 4X model in 4bit will always be far | superior to the 2X model in 8bit, with far lower perplexity. | | Compare 13B's 4bit perplexity of 5.3607 to 7B's 8bit | perplexity of 5.9069. That is over 0.54 lower perplexity for | the same RAM amount by using 4bit! That is MASSIVE! | Taek wrote: | There's also research showing that the perplexity reduction | is less at higher parameter counts. E.g. a 65b parameter | model barely has any impact at all when reducing from 16bit | to 4bit | mmoskal wrote: | Well, if you have a fixed RAM size, you're better off with | the largest model you can fit at 4 bits (13B 4b is way better | than 7B 16b despite being twice smaller). | kherud wrote: | Can somebody please explain how quantization below 8 bit works? | Since a byte is the smallest addressable unit I think, is the | dimensionality of the weights somehow reduced? | f_devd wrote: | I believe it's locally (inner-loop or simd op) up-cast to | float8/float16/int8, but I haven't looked at the internals of | llama.cpp myself | waleedk wrote: | [Author] You approximate the weights using fewer bits. You | also switch to ints instead of floats and then do some fancy | stuff when multiplying to make it all work together. | | More detail than you probably wanted: | https://huggingface.co/blog/hf-bitsandbytes-integration | MacsHeadroom wrote: | The latest release of bitsandbytes uses a new fp4 format. | 4bit floating point scailing results in much lower | perplexity than int4. | | Also note that for a fixed memory (RAM) size, 4bit (even | int4) is always superior, resulting in lower perplexity | than 8bit. | | E.g. LLaMA-13B int4 is far better/lower perplexity than | LLaMA-7B fp8 while using the same amount of RAM. | contravariant wrote: | How come the token to word ratio is smaller than 1 if tokens are | either words or part of words? Shouldn't you expect _more_ tokens | than words? | waleedk wrote: | [Author] Fair point -- I clarified the language and gave a | concrete example. Hope that helps! | yonixw wrote: | That is how I understood it, a token is on average a 3/4 of a | word. "Token to word". So if you want to buy 1000 tokens you | would get effectively 750 words. | [deleted] | renewiltord wrote: | It's the token to word multiplier, yeah. i.e. x tokens = 0.75x | words. | furyofantares wrote: | I think all the ratios given are x:1 and they tell you x. | qeternity wrote: | It's the other way around. | | 1 GPT4 token is equivalent to 50 GPT3.5 tokens. | | 1 token is equivalent to 0.75 words. | contravariant wrote: | That would make it 0.75 tokens to 1 word right? | ramesh1994 wrote: | I think parts of the write-up are great. | | There are some unique assumptions being made in parts of the gist | | > 10: Cost Ratio of OpenAI embedding to Self-Hosted embedding | | > 1: Cost Ratio of Self-Hosted base vs fine-tuned model queries | | I don't know how useful these numbers are if you take away the | assumptions that self-hosted will work as well as API. | | > 10x: Throughput improvement from batching LLM requests | | I see that the write up mentions memory being a caveat to this, | but it also depends on the card specs as well. Memory Bandwidth / | TFLOPs offered by say 4090 is superior while having the same | amount of VRAM as 3090. The caveat mentioned with token length in | the gist itself makes the 10x claim not a useful rule of thumb. | ramesh1994 wrote: | > This means it is way cheaper to look something up in a vector | store than to ask an LLM to generate it. E.g. "What is the | capital of Delaware?" when looked up in an neural information | retrieval system costs about 5x4 less than if you asked | GPT-3.5-Turbo. The cost difference compared to GPT-4 is a | whopping 250x! | | In a narrow use-case of a strict look-up. This seems to | exaggerate the cost difference while having completely | different trade-offs. | abetlen wrote: | I would add the following two numbers if you're generating | realtime text or speech for human consumption: | | - Human Reading Speed (English): ~250 words per minute | | - Human Speaking Speed (English): ~150 words per minute | | Should be treated like the Doherty Threshold [1] for generative | content. | | [1] https://lawsofux.com/doherty-threshold/ | armchairhacker wrote: | But I'd say LLMs produce content faster than I can read or | write it, because they can produce content which is really | dense. | | Ask GPT-4 a question and then answer it yourself. Maybe your | answer will be as good or better than GPT-4's but GPT-4 writes | its answer a lot faster. | Flux159 wrote: | I think that it would be helpful to add a fine-tuning costs for | an open source model (think LLaMA to Alpaca). | | From the phrasing around fine tuning right now it seems like it's | using openai's fine tuning api to determine that cost, but it's | not very clear. | | Also this would be helpful for other foundation models if that | doesn't already exist - how much VRAM to run Stable Diffusion | v2.1 at different resolutions, running Whisper or Bark for audio, | etc. | sebzim4500 wrote: | They mention that they could finetune a 6B model for $7. | Obviously the number depends on the amount of data and the | model size but it's probably not going to be a significant | expense in practice. | PoignardAzur wrote: | > _~$1 million: Cost to train a 13 billion parameter model on 1.4 | trillion tokens_ | | MosaicML claims they trained a 7 billion parameter on 1 trillion | tokens with a budget of $200k. | | https://www.mosaicml.com/blog/mpt-7b | | Does training cost scale linearly with model size and token | count? If so, that suggests a lower bound of $600k to train the | 13 billion params model. (Still roughly the same magnitude) | waleedk wrote: | [Author] Mosaic must be getting some kind of sweetheart deals | on A100 80GB and A100 40GB. The prices they are quoting are not | what say the AWS on-demand prices are. They quote $2 per GPU | for A100 40GB and $2.50 for A100 80GB. That's literally half | the AWS on-demand rate for A100s here: | https://aws.amazon.com/ec2/instance-types/p4/ | | And these are impossible to get. We tried to get some for | Anyscale, and we were told there were no on-demand available | and lead time for reserved (ouchie on the price! You're talking | a quarter of a million dollars a year for one machine at list) | was in weeks. | | Once you take the model size and hefty sweetheart deals into | account, you're within 10%. Mosaic does have some nice whitebox | optimizations, but nothing that radically changes the equation. | fpgaminer wrote: | A100-40GB is like $1.10 on LambdaLabs, on demand. Their | availability is horrific on singles, but I've seen 8x | instances pop up more often than not. And you can rent A100s | for a buck a pop interruptible from other clouds, plenty of | availability. $2 doesn't seem like much of a sweetheart deal. | born-jre wrote: | RANDOM THOUGHT: | | i wonder when we are getting docker for llm ... a Modelfile ? | | FROM "PAAMA/16b" | | APPLY "MNO/DATASET" | | each layer could be lora adapter like thing maybe. | | maybe when AI chips are finally here. | jjtheblunt wrote: | PyTorch tutorial looks similar (lower on the page) | | https://pytorch.org/tutorials/beginner/pytorch_with_examples... | kristjansson wrote: | SQLFlow[0] looks sort of like that: SELECT * | FROM iris.train TO TRAIN DNNClassifier WITH | model.hidden_units = [10, 10], model.n_classes = 3, | train.epoch= 10 COLUMN sepal_length, sepal_width, | petal_length, petal_width LABEL class INTO | sqlflow_models.my_dnn_model; | | No idea how well it works. | | [0]: https://sql-machine-learning.github.io/ | jncraton wrote: | > There's usually no need to go beyond 16-bit accuracy, and most | of the time when you go to 8-bit accuracy there is too much loss | of resolution. | | I'm not sure this is accurate. From what I have seen, 8-bit | quantization is usually fine, and even 4-bit is a viable | tradeoff. Here are some benchmarks from TextSynth showing no | significant degradation between 16 and 8 bit: | | https://textsynth.com/technology.html | | 8-bit uses half as much memory and doubles the throughput for | limited quality loss. | superkuh wrote: | It's true if you're doing training. But for inference severe | quantization is mostly okay. And there are some internal parts | of a transformer running inference with a quantized model where | you might want the x-bit inputs to do calculations with 16 bits | like the dot product similarity between vectors. | Jackson__ wrote: | Even that is being tackled by newer GPU architectures. For | example, novelai is currently training an LLM in fp8 | precision, using H100 GPUs.[1] | | [1] https://blog.novelai.net/anlatan-acquires- | hgx-h100-cluster-4... | | https://blog.novelai.net/text-model-progress-is-going- | good-8... | superkuh wrote: | Cool stuff. I looked at https://en.wikipedia.org/wiki/Hoppe | r_%28microarchitecture%29 and I noticed that that the fp8 | support is only for the tensor cores and not the CUDA side. | Does that mean training with H100 GPU in fp8 mode would use | some software ecosystem that's not the existing vast | existing CUDA one? Or am I just misunderstanding CUDA cores | vs tensor cores? | | PS, as a joke, they should implement GPU fluint8 and get | baked in non-linearity for the activation function without | even using a non-linear function, | https://www.youtube.com/watch?v=Ae9EKCyI1xU ("GradIEEEnt | half decent: The hidden power of imprecise lines" by | suckerpinch) | qeternity wrote: | The problem with 8bit at the moment is massive performance | degradation with bitsandbytes. Recent improvements in 4bit | inference mean that 8bit is now a massive laggard (although | there's no reason not to expect this to resolve). | f_devd wrote: | The article is right, 8-bit (and especially 4-bit) is atypical | for deep learning models and highly depends on the amount of | parameters (larger model can handle more quantization) and can | even depend on specific training hyperparameters (mainly | dropout & weight decay which can induce sparsity) | int_19h wrote: | Thing is, even when the impact from 4-bit is substantial, the | larger parameter count it allows on the same hardware more | than makes up for it. E.g. llama-30b is better at 4-bit than | _any_ derivative of llama-13b, no matter how fine-tuned or | quantized. | waleedk wrote: | [Author] Fair point. Adjusted the language. | | Nonetheless people do tend to use 16 bit huggingface models, | and if you do go to 8 bits and it's wrong, you're never quite | sure if it's the quant or the model. | fzliu wrote: | AFAIK for over-parameterized models, performing quantization or | any other form of compression won't reduce accuracy by much | (don't quote me on this though). ___________________________________________________________________ (page generated 2023-05-17 23:00 UTC)