[HN Gopher] Numbers every LLM developer should know
       ___________________________________________________________________
        
       Numbers every LLM developer should know
        
       Author : richardliaw
       Score  : 227 points
       Date   : 2023-05-17 17:50 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | YetAnotherNick wrote:
       | > ~$1 million: Cost to train a 13 billion parameter model on 1.4
       | trillion tokens
       | 
       | Llama paper mentioned 135,168 A100 hours for training 13 billion
       | model on 1 trillion tokens, which means ~$150k for lambdalabs on
       | demand instance.
        
         | waleedk wrote:
         | [Author] Good luck trying to use clusters of Lambda machines.
         | Lambda labs are cheap for a reason: their API is not very
         | featureful (we looked at them and we saw they didn't even
         | support machine tagging). If you're looking for a box or two,
         | lambda labs is fine. If you're looking for 1,000, not so much.
         | 
         | Plus they don't actually have any actually A100s available at
         | the moment (2022-05-17).
         | 
         | CoreWeave is a nice middle ground. You can at least get the
         | A100 machines into a k8s cluster.
        
       | curiousgal wrote:
       | > _LLM Developer_
       | 
       | This is the fastest I've rolled my eyes in a long time!
        
         | ryanklee wrote:
         | The amount of get-off-my-lawn grognardness that LLM activity
         | inspires is really ridiculous.
         | 
         | I really would ask you to take a second look at the spirit of
         | your comment and think carefully about how much you really
         | understand about the work being done on top of LLMs and if it
         | justifies this kind of response.
        
           | astrea wrote:
           | I had the same reaction as the OP. I'm not a data scientist
           | by trade or title, but I would personally be a little
           | offended. If you designed the Porsche 911, would you not be
           | offended by the shade tree mechanic who simply knows how to
           | change the oil calling himself a Porsche designer/engineer?
        
             | RyanCavanaugh wrote:
             | Context matters. Is a "web developer" someone who makes web
             | pages, or works on a browser rendering engine?
        
             | ryanklee wrote:
             | There are people making applications based on LLMs. You may
             | quibble with the term LLM Developer, but to sneer or roll
             | your eyes at it as if it were prima facie inaccurate or
             | laughable is unjustified.
        
       | cwkoss wrote:
       | Are there any open source host-your-own LLMs that have licensing
       | that allows for commercial use?
        
         | Der_Einzige wrote:
         | Dolly from Databricks is one at least
        
           | waleedk wrote:
           | [Author] TL;DR OS LLM models are coming.
           | 
           | Dolly's not that great -- I've hit lots of issues using it to
           | be honest .
           | 
           | MosaicML has a nice commercially usable model here:
           | https://www.mosaicml.com/blog/mpt-7b
           | 
           | I think they're one of the leading ones (bias: they're kinda
           | competitors to my employer Anyscale, but you gotta say
           | something's good when it is).
           | 
           | Red Pajama are leading an effort to build a fully open source
           | model similar to LLaMa.
           | https://www.together.xyz/blog/redpajama
        
         | int_19h wrote:
         | https://github.com/BlinkDL/RWKV-LM
        
         | elorant wrote:
         | Vicuna-13b is on Apache License 2.0.
        
           | twbarr wrote:
           | Vicuna is a delta model that you have to apply on top of
           | LLaMA.
        
       | throwaway888abc wrote:
       | Excellent! Thank you so much for making/posting this
        
         | waleedk wrote:
         | [Author] You're welcome -- glad it was useful!
        
       | MacsHeadroom wrote:
       | > Of course there are efforts to reduce this, notably llama.cpp
       | which runs a 13 billion parameter model on a 6GB GPU by
       | quantizing aggressively down to 4 bits (and 8 bits without too
       | much impact), but that's atypical.
       | 
       | No, 4bit quantization is the typical case.
       | 
       | At 4bit you can fit twice the parameters of 8bit in the same
       | space for far better performance/perplexity/quality.
       | 
       | Running LLMs higher than 4bit is atypical and almost always sub-
       | optimal (compared to running a model half the size in 8bit).
       | 
       | Even pretraining and finetuning in 4bit is likely to become the
       | norm soon as fp4 becomes more well understood.
        
         | moffkalast wrote:
         | > llama.cpp which runs a 13 billion parameter model on a 6GB
         | GPU
         | 
         | I think that's a typo there too, the 13B model needs like 10G
         | of memory for 4 bits, it's the 7B one that fits into 6G. Well
         | unless you do the split thing with some layers on the CPU I
         | guess.
        
         | Der_Einzige wrote:
         | No it isn't, quantization is not free. You lose a significant
         | amount of performance that you are not measuring properly in
         | automated benchmarks when you quantize to that level.
         | 
         | You can see it in real time when you take most LLMs and compare
         | them at different quantization levels. I can see the
         | degradation even in the largest llama quite badly even at 8
         | bits.
        
           | astrange wrote:
           | If you take a model and quantize it it's obviously going to
           | get worse, but what if you train it again after that?
        
           | MacsHeadroom wrote:
           | Quantization is not free, but VRAM is even less free.
           | 
           | If you have X amount of VRAM and can fit a 16bit model of
           | size 2X in 8bit or a model of size 4X in 4bit then the 4X
           | model in 4bit is ALWAYS superior with lower perplexity and
           | better performance.
           | 
           | You LOSE performance by using a smaller model in 8bit vs a
           | larger model in 4bit.
        
         | waleedk wrote:
         | [Author] Completely disagree. Any analysis shows that you see
         | perplexity reduction at 4 bits. Have a look at llama.cpp's
         | results here:
         | 
         | https://github.com/ggerganov/llama.cpp#quantization
         | 
         | 4 bit has a perplexity score 0.13 or so higher.
        
           | MacsHeadroom wrote:
           | You're just wrong. You're looking at the wrong numbers. The
           | perplexity score of a model with twice the parameters in half
           | the bits (4bit) is FAR LOWER (ie better).
           | 
           | If you are limited to X RAM and have two 16bit models of size
           | 4X and 2X then the 4X model in 4bit will always be far
           | superior to the 2X model in 8bit, with far lower perplexity.
           | 
           | Compare 13B's 4bit perplexity of 5.3607 to 7B's 8bit
           | perplexity of 5.9069. That is over 0.54 lower perplexity for
           | the same RAM amount by using 4bit! That is MASSIVE!
        
           | Taek wrote:
           | There's also research showing that the perplexity reduction
           | is less at higher parameter counts. E.g. a 65b parameter
           | model barely has any impact at all when reducing from 16bit
           | to 4bit
        
           | mmoskal wrote:
           | Well, if you have a fixed RAM size, you're better off with
           | the largest model you can fit at 4 bits (13B 4b is way better
           | than 7B 16b despite being twice smaller).
        
         | kherud wrote:
         | Can somebody please explain how quantization below 8 bit works?
         | Since a byte is the smallest addressable unit I think, is the
         | dimensionality of the weights somehow reduced?
        
           | f_devd wrote:
           | I believe it's locally (inner-loop or simd op) up-cast to
           | float8/float16/int8, but I haven't looked at the internals of
           | llama.cpp myself
        
           | waleedk wrote:
           | [Author] You approximate the weights using fewer bits. You
           | also switch to ints instead of floats and then do some fancy
           | stuff when multiplying to make it all work together.
           | 
           | More detail than you probably wanted:
           | https://huggingface.co/blog/hf-bitsandbytes-integration
        
             | MacsHeadroom wrote:
             | The latest release of bitsandbytes uses a new fp4 format.
             | 4bit floating point scailing results in much lower
             | perplexity than int4.
             | 
             | Also note that for a fixed memory (RAM) size, 4bit (even
             | int4) is always superior, resulting in lower perplexity
             | than 8bit.
             | 
             | E.g. LLaMA-13B int4 is far better/lower perplexity than
             | LLaMA-7B fp8 while using the same amount of RAM.
        
       | contravariant wrote:
       | How come the token to word ratio is smaller than 1 if tokens are
       | either words or part of words? Shouldn't you expect _more_ tokens
       | than words?
        
         | waleedk wrote:
         | [Author] Fair point -- I clarified the language and gave a
         | concrete example. Hope that helps!
        
         | yonixw wrote:
         | That is how I understood it, a token is on average a 3/4 of a
         | word. "Token to word". So if you want to buy 1000 tokens you
         | would get effectively 750 words.
        
         | [deleted]
        
         | renewiltord wrote:
         | It's the token to word multiplier, yeah. i.e. x tokens = 0.75x
         | words.
        
         | furyofantares wrote:
         | I think all the ratios given are x:1 and they tell you x.
        
           | qeternity wrote:
           | It's the other way around.
           | 
           | 1 GPT4 token is equivalent to 50 GPT3.5 tokens.
           | 
           | 1 token is equivalent to 0.75 words.
        
           | contravariant wrote:
           | That would make it 0.75 tokens to 1 word right?
        
       | ramesh1994 wrote:
       | I think parts of the write-up are great.
       | 
       | There are some unique assumptions being made in parts of the gist
       | 
       | > 10: Cost Ratio of OpenAI embedding to Self-Hosted embedding
       | 
       | > 1: Cost Ratio of Self-Hosted base vs fine-tuned model queries
       | 
       | I don't know how useful these numbers are if you take away the
       | assumptions that self-hosted will work as well as API.
       | 
       | > 10x: Throughput improvement from batching LLM requests
       | 
       | I see that the write up mentions memory being a caveat to this,
       | but it also depends on the card specs as well. Memory Bandwidth /
       | TFLOPs offered by say 4090 is superior while having the same
       | amount of VRAM as 3090. The caveat mentioned with token length in
       | the gist itself makes the 10x claim not a useful rule of thumb.
        
         | ramesh1994 wrote:
         | > This means it is way cheaper to look something up in a vector
         | store than to ask an LLM to generate it. E.g. "What is the
         | capital of Delaware?" when looked up in an neural information
         | retrieval system costs about 5x4 less than if you asked
         | GPT-3.5-Turbo. The cost difference compared to GPT-4 is a
         | whopping 250x!
         | 
         | In a narrow use-case of a strict look-up. This seems to
         | exaggerate the cost difference while having completely
         | different trade-offs.
        
       | abetlen wrote:
       | I would add the following two numbers if you're generating
       | realtime text or speech for human consumption:
       | 
       | - Human Reading Speed (English): ~250 words per minute
       | 
       | - Human Speaking Speed (English): ~150 words per minute
       | 
       | Should be treated like the Doherty Threshold [1] for generative
       | content.
       | 
       | [1] https://lawsofux.com/doherty-threshold/
        
         | armchairhacker wrote:
         | But I'd say LLMs produce content faster than I can read or
         | write it, because they can produce content which is really
         | dense.
         | 
         | Ask GPT-4 a question and then answer it yourself. Maybe your
         | answer will be as good or better than GPT-4's but GPT-4 writes
         | its answer a lot faster.
        
       | Flux159 wrote:
       | I think that it would be helpful to add a fine-tuning costs for
       | an open source model (think LLaMA to Alpaca).
       | 
       | From the phrasing around fine tuning right now it seems like it's
       | using openai's fine tuning api to determine that cost, but it's
       | not very clear.
       | 
       | Also this would be helpful for other foundation models if that
       | doesn't already exist - how much VRAM to run Stable Diffusion
       | v2.1 at different resolutions, running Whisper or Bark for audio,
       | etc.
        
         | sebzim4500 wrote:
         | They mention that they could finetune a 6B model for $7.
         | Obviously the number depends on the amount of data and the
         | model size but it's probably not going to be a significant
         | expense in practice.
        
       | PoignardAzur wrote:
       | > _~$1 million: Cost to train a 13 billion parameter model on 1.4
       | trillion tokens_
       | 
       | MosaicML claims they trained a 7 billion parameter on 1 trillion
       | tokens with a budget of $200k.
       | 
       | https://www.mosaicml.com/blog/mpt-7b
       | 
       | Does training cost scale linearly with model size and token
       | count? If so, that suggests a lower bound of $600k to train the
       | 13 billion params model. (Still roughly the same magnitude)
        
         | waleedk wrote:
         | [Author] Mosaic must be getting some kind of sweetheart deals
         | on A100 80GB and A100 40GB. The prices they are quoting are not
         | what say the AWS on-demand prices are. They quote $2 per GPU
         | for A100 40GB and $2.50 for A100 80GB. That's literally half
         | the AWS on-demand rate for A100s here:
         | https://aws.amazon.com/ec2/instance-types/p4/
         | 
         | And these are impossible to get. We tried to get some for
         | Anyscale, and we were told there were no on-demand available
         | and lead time for reserved (ouchie on the price! You're talking
         | a quarter of a million dollars a year for one machine at list)
         | was in weeks.
         | 
         | Once you take the model size and hefty sweetheart deals into
         | account, you're within 10%. Mosaic does have some nice whitebox
         | optimizations, but nothing that radically changes the equation.
        
           | fpgaminer wrote:
           | A100-40GB is like $1.10 on LambdaLabs, on demand. Their
           | availability is horrific on singles, but I've seen 8x
           | instances pop up more often than not. And you can rent A100s
           | for a buck a pop interruptible from other clouds, plenty of
           | availability. $2 doesn't seem like much of a sweetheart deal.
        
       | born-jre wrote:
       | RANDOM THOUGHT:
       | 
       | i wonder when we are getting docker for llm ... a Modelfile ?
       | 
       | FROM "PAAMA/16b"
       | 
       | APPLY "MNO/DATASET"
       | 
       | each layer could be lora adapter like thing maybe.
       | 
       | maybe when AI chips are finally here.
        
         | jjtheblunt wrote:
         | PyTorch tutorial looks similar (lower on the page)
         | 
         | https://pytorch.org/tutorials/beginner/pytorch_with_examples...
        
         | kristjansson wrote:
         | SQLFlow[0] looks sort of like that:                   SELECT *
         | FROM iris.train         TO TRAIN DNNClassifier         WITH
         | model.hidden_units = [10, 10], model.n_classes = 3,
         | train.epoch= 10         COLUMN sepal_length, sepal_width,
         | petal_length, petal_width         LABEL class         INTO
         | sqlflow_models.my_dnn_model;
         | 
         | No idea how well it works.
         | 
         | [0]: https://sql-machine-learning.github.io/
        
       | jncraton wrote:
       | > There's usually no need to go beyond 16-bit accuracy, and most
       | of the time when you go to 8-bit accuracy there is too much loss
       | of resolution.
       | 
       | I'm not sure this is accurate. From what I have seen, 8-bit
       | quantization is usually fine, and even 4-bit is a viable
       | tradeoff. Here are some benchmarks from TextSynth showing no
       | significant degradation between 16 and 8 bit:
       | 
       | https://textsynth.com/technology.html
       | 
       | 8-bit uses half as much memory and doubles the throughput for
       | limited quality loss.
        
         | superkuh wrote:
         | It's true if you're doing training. But for inference severe
         | quantization is mostly okay. And there are some internal parts
         | of a transformer running inference with a quantized model where
         | you might want the x-bit inputs to do calculations with 16 bits
         | like the dot product similarity between vectors.
        
           | Jackson__ wrote:
           | Even that is being tackled by newer GPU architectures. For
           | example, novelai is currently training an LLM in fp8
           | precision, using H100 GPUs.[1]
           | 
           | [1] https://blog.novelai.net/anlatan-acquires-
           | hgx-h100-cluster-4...
           | 
           | https://blog.novelai.net/text-model-progress-is-going-
           | good-8...
        
             | superkuh wrote:
             | Cool stuff. I looked at https://en.wikipedia.org/wiki/Hoppe
             | r_%28microarchitecture%29 and I noticed that that the fp8
             | support is only for the tensor cores and not the CUDA side.
             | Does that mean training with H100 GPU in fp8 mode would use
             | some software ecosystem that's not the existing vast
             | existing CUDA one? Or am I just misunderstanding CUDA cores
             | vs tensor cores?
             | 
             | PS, as a joke, they should implement GPU fluint8 and get
             | baked in non-linearity for the activation function without
             | even using a non-linear function,
             | https://www.youtube.com/watch?v=Ae9EKCyI1xU ("GradIEEEnt
             | half decent: The hidden power of imprecise lines" by
             | suckerpinch)
        
         | qeternity wrote:
         | The problem with 8bit at the moment is massive performance
         | degradation with bitsandbytes. Recent improvements in 4bit
         | inference mean that 8bit is now a massive laggard (although
         | there's no reason not to expect this to resolve).
        
         | f_devd wrote:
         | The article is right, 8-bit (and especially 4-bit) is atypical
         | for deep learning models and highly depends on the amount of
         | parameters (larger model can handle more quantization) and can
         | even depend on specific training hyperparameters (mainly
         | dropout & weight decay which can induce sparsity)
        
           | int_19h wrote:
           | Thing is, even when the impact from 4-bit is substantial, the
           | larger parameter count it allows on the same hardware more
           | than makes up for it. E.g. llama-30b is better at 4-bit than
           | _any_ derivative of llama-13b, no matter how fine-tuned or
           | quantized.
        
         | waleedk wrote:
         | [Author] Fair point. Adjusted the language.
         | 
         | Nonetheless people do tend to use 16 bit huggingface models,
         | and if you do go to 8 bits and it's wrong, you're never quite
         | sure if it's the quant or the model.
        
         | fzliu wrote:
         | AFAIK for over-parameterized models, performing quantization or
         | any other form of compression won't reduce accuracy by much
         | (don't quote me on this though).
        
       ___________________________________________________________________
       (page generated 2023-05-17 23:00 UTC)