[HN Gopher] Cerebras-GPT: A Family of Open, Compute-Efficient, L...
       ___________________________________________________________________
        
       Cerebras-GPT: A Family of Open, Compute-Efficient, Large Language
       Models
        
       Author : asb
       Score  : 341 points
       Date   : 2023-03-28 16:34 UTC (6 hours ago)
        
 (HTM) web link (www.cerebras.net)
 (TXT) w3m dump (www.cerebras.net)
        
       | ivanvas wrote:
       | Is it currently possible to find-tune any of the foundation
       | modules available on a few Gb of unsupervised text?
        
       | simonw wrote:
       | Does the chinchilla recipe still hold today? I got the impression
       | that the LLaMA paper proposed a different result where throwing
       | far more tokens at the problem had a very meaningful impact, or
       | did I misunderstand that?
        
         | evanmays wrote:
         | There's discussion elsewhere in this thread what chinchilla
         | actually means. I'll only compare it to llama.
         | 
         | Tldr; Chinchilla isn't wrong, it's just useful for a different
         | goal than the llama paper.
         | 
         | There's 3 hyper parameters to tweak here. Model size (parameter
         | count), number of tokens pre trained on, and amount of compute
         | available. End performance is in theory a function of these
         | three hyperparameters.
         | 
         | You can think of this as an optimization function.
         | 
         | Chinchilla says, if you have a fixed amount of compute, here's
         | what size and number of tokens to train for maximum
         | performance.
         | 
         | A lot of times, we have a fixed model size though though,
         | because size impact inference costs and latency. Llama operates
         | in this territory. They choose to fix the model size instead of
         | the amount of compute.
         | 
         | This could explain gaps in performance between Cerebras models
         | of size X and llama models of size X. Llama models of size X
         | have way more compute behind them
        
         | espadrine wrote:
         | I don't think it holds for two reasons.
         | 
         | First, it only holds for a given architecture and
         | implementation. Obviously, a different architecture will have a
         | different training slope. This is clear when comparing LSTM
         | with Transformers, but is also true between transformers that
         | use prenorm/SwiGLU/rotary-positional, and those that follow
         | Vaswani 2017.
         | 
         | In terms of implementation, some algorithms yield the same
         | result with fewer operations (IO, like FlashAttention and other
         | custom CUDA kernels, and parallelism, like PaLM, which both
         | came after Chinchilla), which unambiguously affect the Tflops
         | side of the Chinchilla equation. Also, faster algorithms and
         | better parallelization will yield a given loss sooner, while
         | less power-hunger setups will do that cheaper.
         | 
         | Second, even in the original Chinchilla paper in figure 2, some
         | lines are stopped early before reaching Pareto (likely because
         | it ran out of tokens, but LLaMA makes it seem that >1 epoch
         | training is fine).
        
       | whalesalad wrote:
       | This "AI spring" is really snowballing with the crazy nouns and
       | terminology. Alpaca, llama and now chinchilla??
        
         | murkt wrote:
         | Chinchilla actually came before alpaca and llama.
         | 
         | Every new variation of model gets some new name, just like
         | every library gets a new name. There were all kinds of BERTs
         | before - DistilBert, Roberta, SciBERT, Schmobert, Schmuber,
         | etc. Many hundreds of them, I think.
        
         | whoisnnamdi wrote:
         | Chinchilla actually came first!
        
         | jhbadger wrote:
         | As mentioned, chinchilla is not part of this trend, and
         | chinchillas are rodents. Alpacas and llamas are South American
         | camelids (animals related to camels). So if additional names
         | are needed, I would expect them to be vicuna and guanaco, as
         | they are also in the group.
        
           | mometsi wrote:
           | I think the relevant category is "Adorable Fuzzy Critters of
           | the Andes". See also
           | https://en.wikipedia.org/wiki/Spectacled_bear
        
         | ramesh1994 wrote:
         | The term "chinchilla" predates llama/alpaca. It doesn't
         | directly map to a specific model, rather a family of compute-
         | optimal models.
        
       | mometsi wrote:
       | Summary: This is a company that makes AI accelerator ICs. They
       | reimplemented Chinchilla and released the model weights under a
       | permissive license.
        
         | bogwog wrote:
         | In other words, they're actually incentivized to help make LLMs
         | as accessible as possible, rather than try to keep them locked
         | up to hide them from competitors.
         | 
         | Which makes me wonder if Nvidia is doing anything with LLMs
         | too?
        
           | vintermann wrote:
           | NVidia has certainly pushing the envelope on image
           | generation. StyleGAN3 was really cool when it came.
           | 
           | But it is an issue that their chips are hardly optimized for
           | LLMs.
        
             | meghan_rain wrote:
             | How can a GPU be optimized for StyleGAN but not LLMs?
             | Serious question.
        
               | MacsHeadroom wrote:
               | RAM. GPT-3 is over 600GB, ie just the max RAM of 8xA100s,
               | because that's all the hardware can fit.
               | 
               | StableDiffusion plus a whole chain of imagenets can make
               | any visual imagery imaginable in 2GB of RAM. Meanwhile
               | 2GB of RAM barely runs a basic tiny text completion NN
               | that can't do anything intelligent. Text requires a lot
               | more parameters (and more memory/RAM) than images.
        
               | brucethemoose2 wrote:
               | The Cerebras node's actual "RAM" (the 40GB of SRAM) is
               | pretty modest too, but being an _enormous_ chip with the
               | networked storage pools is certainly a better situation
               | than a bunch of A100s reaching out to every other A100.
               | 
               | Honestly, all the AI ASIC makers drastically
               | underestimated the RAM requirements of future models.
               | Graphcore's 4GB and Tenstorrent's 8GB per IC is kinda
               | laughable, and it takes them longer to adjust than
               | Nvidia. And Cerebras' original pitch was "fit the entire
               | model into SRAM!"
        
           | option wrote:
           | obviously yes, there are larger (20B) models trained on more
           | tokens that are released by NVIDIA under commercial-friendly
           | license on HF
           | 
           | https://huggingface.co/nvidia/nemo-megatron-gpt-20B
        
       | antimatter15 wrote:
       | Looking at their charts it seems like their 6.7B model is
       | considerably worse than GPT-J which is an existing open 6B model
       | from several years ago.
       | 
       | I wish rather than stopping training early they would have run
       | more data through a small model so we could have something more
       | competitive with LLaMA 7B.
        
         | cs-fan-101 wrote:
         | Someone posted this repost from the Cerebras Discord earlier,
         | but sharing for visibility -
         | 
         | "We chose to train these models to 20 tokens per param to fit a
         | scaling law to the Pile data set. These models are optimal for
         | a fixed compute budget, not necessarily "best for use". If you
         | had a fixed parameter budget (e.g., because you wanted to fit
         | models on certain hardware) you would train on more tokens. We
         | do that for our customers that seek that performance and want
         | to get LLaMA-like quality with a commercial license"
        
           | HanClinto wrote:
           | Sounds like we should crowd-fund the cost to train and open
           | source one of these models with LLaMa-like quality.
           | 
           | I'd chip in!
        
             | brucethemoose2 wrote:
             | TBH that seems like a good job for Cerebras.
             | 
             | There are plenty of such efforts, but the organizer needs
             | some kind of significance to attract a critical mass, and a
             | AI ASIC chip designer seems like a good candidate.
             | 
             | Then again, maybe they prefer a bunch of privately trained
             | models over an open one since that sells more ASIC time?
        
           | brucethemoose2 wrote:
           | > Cerebras Discord
           | 
           | This is really weird to hear out loud.
           | 
           | I still think of Discord as a niche gaming chatroom, even
           | though I know that (for instance) a _wafer scale IC design
           | company_ is hosting a Discord now.
        
       | simonw wrote:
       | You can try out some of these models on Hugging face here:
       | https://huggingface.co/cerebras/Cerebras-GPT-1.3B
       | 
       | That was the largest that had inference enabled - I'd really like
       | to try this one: https://huggingface.co/cerebras/Cerebras-GPT-13B
        
       | amilios wrote:
       | Comparing the 13B model here
       | https://huggingface.co/cerebras/Cerebras-GPT-13B to LLaMA-13B
       | https://github.com/facebookresearch/llama/blob/main/MODEL_CA...
       | you can see that in all of the reasoning tasks Cerebras-GPT lags
       | behind. Any reason to use Cerebras instead of LLaMA? Doesn't seem
       | like it.
        
         | option wrote:
         | it lags behind because according to their blogpost it was
         | trained on <300B tokens. LLaMAs as far as I know were trained
         | on more than trillion
        
           | gpm wrote:
           | The LLaMa paper says 1 trillion for the smaller models (7B,
           | 13B) and 1.4 trillion for the larger models (30B, 65B)
        
           | [deleted]
        
         | mdagostino wrote:
         | LLaMA is non-commercial
        
         | potatoman22 wrote:
         | Can the LLaMA weights be used for commercial products?
        
           | espadrine wrote:
           | There are two aspects to it.
           | 
           | The first one is whether they would actually sue. The optics
           | would be terrible. A similar situation occurred in the 90s
           | when the RC4 cipher's code was leaked. Everyone used the
           | leaked code pretending that it was a new cipher called
           | arc4random, even though they had confirmation from people
           | that licensed the cipher that its output was identical.
           | Nobody was sued, and the RSA company never acknowledged it.
           | 
           | The second one is related to the terms. The LLaMA weights
           | themselves are licensed under terms that exclude commercial
           | use:[0]
           | 
           | > _You will not [...] use [...] the Software Products (or any
           | derivative works thereof, works incorporating the Software
           | Products, or any data produced by the Software), [...] for
           | [...] any commercial or production purposes._
           | 
           | But the definition of derivative works is gray. AFAIK, if
           | LLaMA is distilled, there is an unsettled argument to be had
           | that the end result is not a LLaMA derivative, and cannot be
           | considered copyright or license infringement, similar to how
           | models trained on blog articles and tweets are not infringing
           | on those authors' copyright or licensing. The people that
           | make the new model may be in breach of the license if they
           | agreed to it, but maybe not the people that use that new
           | model. Otherwise, ad absurdum, a model trained on the
           | Internet will have content that was generated by LLaMA in its
           | training set, so all models trained on the Internet after Feb
           | 2023 will break the license.
           | 
           | IANAL, but ultimately, Meta wins more by benefiting from what
           | the community contributes on top of their work (similar to
           | what happened with React), than by suing developers that use
           | derivatives of their open models.
           | 
           | [0]: https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2
           | jP4Z...
        
           | gpm wrote:
           | Unclear, likely jurisdiction dependent, almost certainly not
           | if you need to operate world wide.
        
       | binarymax wrote:
       | Here are the zero-shot accuracy numbers posted in the Huggingface
       | evaluations for Cerebras-GPT 13B vs. the results of LLaMa 13B in
       | their paper:                   Model              BoolQ PIQA SIQA
       | HellaSwag WinoGrande ARC-e ARC-c OBQA         LLaMa 13B
       | 78.1  80.1 50.4 79.2      73         74.8  52.7  56.4
       | Cerebras-GPT 13B   -     76.6 -    51.3      64.6       71.4
       | 36.7  28.6
        
         | wsgeorge wrote:
         | I guess it's something. It still goes to show how far open
         | models are behind the proprietary SOTA.
        
           | topynate wrote:
           | From their discord:
           | 
           | > It would be interesting to know why you chose those FLOPS
           | targets, unfortunately it looks like the models are quite
           | under pre-trained (260B tokens for 13B model)
           | 
           | > We chose to train these models to 20 tokens per param to
           | fit a scaling law to the Pile data set. These models are
           | optimal for a fixed compute budget, not necessarily "best for
           | use". If you had a fixed parameter budget (e.g., because you
           | wanted to fit models on certain hardware) you would train on
           | more tokens. We do that for our customers that seek that
           | performance and want to get LLaMA-like quality with a
           | commercial license
           | 
           | Which is the point made elsewhere in these comments, e.g.
           | https://news.ycombinator.com/item?id=35344192, and also
           | usefully shows how open Cerebras are. They're pretty open,
           | but not as much as they would be if they were optimising for
           | filling in other companies' moats.
        
           | binarymax wrote:
           | Indeed but this is zero-shot performance. Fine-tuning for a
           | task should get you pretty good results. I'm interested in
           | seeing the results of an Alpaca method against this Cerebras
           | 13B model.
        
             | MacsHeadroom wrote:
             | >I'm interested in seeing the results of an Alpaca method
             | 
             | You're talking apples to oranges. The "Alpaca method" is a
             | dataset generation method. Nothing about Alpaca's training
             | method is novel, interesting, or efficient. Alpaca used the
             | same standard training method everyone else uses, A100
             | clusters.
             | 
             | If you mean LoRA/PEFT training which people used to
             | replicate Alpaca then that is also apples to oranges
             | because LoRA/PEFT is a finetuning method not a pre-training
             | method.
        
               | [deleted]
        
             | Vetch wrote:
             | Base model performance is what's most important and also
             | impacts fine-tuning quality. Practically, a model that's
             | good out of the box with minimal fine-tuning is also useful
             | to more people. Since they focused on being training
             | compute optimal for some budget, expect their models to lag
             | behind Llama overall. Their 6.7B version should lag behind
             | GPT-J, assuming 20 tokens per parameter.
             | 
             | The Pythia models are also worth checking out, they might
             | be better than or matched to CerebrasGPTs at each size
             | (although they warn it is not intended for deployment).
             | 
             | Conclusion: the landscape of top open models remains
             | unchanged.
        
       | rnosov wrote:
       | I might be missing something but it looks to me that actually
       | running this "open" model requires special hardware only
       | accessible with a cloud subscription with 60 000 USD / week
       | minimum spend[1]. Can anyone confirm if you can run it on your
       | own hardware? If software is open but hardware is locked I don't
       | see the point.
       | 
       | [1] https://www.hpcwire.com/2021/09/16/cerebras-wafer-scale-
       | engi....
       | 
       | EDIT: Ok, looks like I've missed the hugging face repo. The
       | language they use is a bit confusing.
        
         | bubblethink wrote:
         | You can run inference on GPUs. These are just models and
         | weights.
        
         | simonw wrote:
         | The PyTorch model files are already available to download from
         | Hugging Face - the largest one looks to be 52GB. They should
         | run on any hardware that can run regular PyTorch models.
        
       | JamesCoyne wrote:
       | Slightly off-topic:
       | 
       | I remember seeing news about the enormous chip Cerebras was/is
       | selling (pdf
       | https://f.hubspotusercontent30.net/hubfs/8968533/WSE-2%20Dat...).
       | 
       | Has there been any indication that the LLMs released in the last
       | few months use exotic hardware like this, or is it all "standard"
       | hardware?
        
         | wmf wrote:
         | OpenAI uses Nvidia GPUs and Google uses their TPUs.
        
         | ipsum2 wrote:
         | Everyone except Google uses Nvidia for training. Cerebras,
         | Gaudi, and other custom AI accelerators have unable to surpass
         | Nvidia in performance/$ and performance/watt yet.
        
         | brucethemoose2 wrote:
         | You might see more training on Intel XPUs when they come out,
         | since they have such enormous RAM pools. Maybe AMD MI300s and
         | Intel Ponte Vecchio (both 128GB) in the shorter term, though I
         | think they will mostly be in HPC supercomputers instead of
         | cloud instances.
        
       | eternalban wrote:
       | > It takes substantial technical expertise to train very large
       | models on GPUs. In the recently released GPT-4 Technical Report,
       | OpenAI credits over thirty contributors just for compute
       | infrastructure and scaling.
       | 
       | This is called a silver lining for some (in case you were worried
       | about gpt taking your job). Privacy requirements alone will in
       | the near term force major companies to run their own inference
       | (if not training). The expertise required are nearly identical to
       | that of running large scale distributed computational graphs.
       | 
       | This is an interesting diveragence from what happened with web.
       | The backends started out simple before map-reduce and before
       | deconstructing databases and processing distributed logs. With
       | ML, we'll jump right into the complex backends in tandem with
       | easy-picking early stage edge applications (which we see daily on
       | HN).
        
       | 2bitencryption wrote:
       | This type of article (or press release, or whatever you want to
       | call it) is exactly what makes the future so interesting.
       | 
       | The cat is out of the bag, the genie is out of the bottle, the
       | confetti has left the cannon[0].
       | 
       | It's tempting to see a world dominated by Google Bard, ChatGPT,
       | Bing Search, etc. And no doubt, they will be huge players, with
       | services that are far more powerful than anything that can be run
       | on the edge.
       | 
       | But. BUT. The things that we can do on the edge are _incredible_
       | now. Just imagine a year from now, or two. These earth-shattering
       | models, which seem to be upending a whole industry, will soon
       | have equivalents that run on the edge. Without services spying on
       | your data. Without censorship on what the model can /cannot say.
       | Because it's all local.
       | 
       | When was the last time this happened? There _will_ be players who
       | publish weights for models that are free to use. The moment that
       | torrent magnet link is published, it 's out in the wild. And
       | smart people _will_ package them as  "one click installers" for
       | people who aren't tech-savvy. This is already happening.
       | 
       | So every time you're amazed by something chat-gpt4 says, remember
       | that soon this will be in your pocket.
       | 
       | [0] the "confetti" idiom brought to you by chat-gpt4.
        
         | jazzkingrt wrote:
         | Serious question: is it typical to describe client-side
         | computing as "on the edge"?
         | 
         | I thought running something on the edge referred to running it
         | in close network proximity to the user, rather than users
         | having control and running things themselves.
        
           | dragonwriter wrote:
           | > Serious question: is it typical to describe client-side
           | computing as "on the edge"?
           | 
           | Somewhat; its consistent with, e.g., Google's "Edge TPU"
           | designation for its client-side neural processors.
           | 
           | > I thought running something on the edge referred to running
           | it in close network proximity to the user
           | 
           | Typically, but on the client device _is_ the limit-case of
           | "close network proximity to the user", so the use is
           | consistent.
        
           | capableweb wrote:
           | Yes, "edge computing" can refer to both computing done as
           | close to the user as possible geographically, or even on the
           | device itself. If someone says "I wanna do edge computing"
           | it's not clear enough to know if they just want to have
           | servers they control as close to the user as possible, or do
           | the computing on the device itself. I think Apple would say
           | "edge computing" is on the actual device while CloudFlare
           | would say "edge computing" is on their infrastructure, but
           | distributed to be physically closer to the end user.
        
             | iamerroragent wrote:
             | I guess I've been out of the loop for a bit and didn't
             | realize that "edge computing" became a term since cloud
             | computing took off.
             | 
             | It is kind of cyclical then is not?
             | 
             | By that I mean computers used to be shared and to log into
             | it through a terminal.
             | 
             | Then the PC came around.
             | 
             | Then about 15 years ago Cloud computing became the rage
             | (really an extension or more sophisticated system than the
             | first time shared computers)
             | 
             | Now we're back to local computing. I even see more self
             | hosting and moving away from cloud due to costs.
             | 
             | All that rant is to say is it's interesting.
             | 
             | Side note, getting this AI to be localized as much as
             | possible I imagine will be really useful in the medical
             | industry because it helps alleviate HIPAA requirements.
        
               | nordsieck wrote:
               | > It is kind of cyclical then is not?
               | 
               | > By that I mean computers used to be shared and to log
               | into it through a terminal.
               | 
               | > Then the PC came around.
               | 
               | > Then about 15 years ago Cloud computing became the rage
               | (really an extension or more sophisticated system than
               | the first time shared computers)
               | 
               | There's a really neat article called "The Eternal
               | Mainframe"[1] that you might be interested. It explores
               | this idea in greater depth.
               | 
               | ---
               | 
               | 1. http://www.winestockwebdesign.com/Essays/Eternal_Mainf
               | rame.h...
        
               | iamerroragent wrote:
               | Thanks, that was an interesting read!
               | 
               | I wonder if the author's perspective has changed with
               | regards to freedom to compute.
               | 
               | Social Media is often used as an example of privacy
               | invasion though I've failed to see why concerns over
               | Facebook handling your private data is worrying when they
               | don't have a product you need to have.
               | 
               | Email on the other hand, is pretty much a necessity today
               | so privacy concerns are vital there imo. Of course you
               | can host your own server whereas you can't host your own
               | Facebook.
        
           | wsgeorge wrote:
           | I believe this has been extended to mean "on device", which
           | is interesting. See Gerganov's article on Github [0]. I wrote
           | about this here [1] where I made a contrast between the core
           | and the edge. I think the term maps well to this meaning.
           | 
           | What I find more interesting is that in the classic "close
           | network proximity", some parts of the world may not have
           | benefited as much from that trend since the closest nodes of
           | a global delivery network could be several countries away.
           | 
           | [0] https://github.com/ggerganov/llama.cpp/discussions/205
           | 
           | [1] https://medium.com/sort-of-like-a-tech-diary/consumer-ai-
           | is-...
        
             | TeMPOraL wrote:
             | > _I believe this has been extended to mean "on device",
             | which is interesting._
             | 
             | I don't like the connotations this carries. This is almost
             | openly talking about reaching all the way into peoples'
             | hardware to run your software, for your benefit, on them,
             | without their knowledge, consent or control...
        
               | wsgeorge wrote:
               | I see. Hadn't considered this. Yes, I see how that might
               | be a concern.
               | 
               | What I think is important in this AI Spring is that we
               | make it possible for people to run their own models on
               | their own hardware too, without having to submit anything
               | to a large, centralised model for inference.
        
         | matchagaucho wrote:
         | An LLM running locally providing type-ahead completions seems
         | inevitable.
        
         | lioeters wrote:
         | Yes, yes, and yes. I'm waiting for an actually open AI that can
         | run on the edge, purely on commodity hardware like our laptops
         | and phones - it's inevitable.
         | 
         | I imagine this "cat out of the bag" situation, the
         | democratization and commodification of powerful technology
         | accessible and affordable to the public, is similar to what's
         | happening with single-board computers and microcontrollers like
         | Raspberry Pi, Arduino, ESP32.
         | 
         | It might be similar to what happened with mobile phones, but
         | there the power was quite restricted. The (mostly) duopoly of
         | iOS and Android, with devices and apps locked down in various
         | ways. Sure we can "jail break" and "root" our phone, but that's
         | not for the general public.
         | 
         | Maybe solar energy production is going through a similar
         | process, with panels and batteries becoming more efficient and
         | affordable every year.
         | 
         | Certainly, it reminds one of the history of personal computers,
         | the way such a powerful general-purpose tool became ubiquitous
         | and local.
        
           | akiselev wrote:
           | After using ChatGPT 4 extensively for a few days, I think
           | we're probably only a few years away from the first
           | generation of truly conversational assistants ala Jarvis in
           | Iron Man. Between LangChain and existing voice recognition
           | software, we've already 95% of the way there, it just needs
           | to be packaged up into a UI/UX that makes sense.
           | 
           | These local models are absolutely critical for that to happen
           | though. I'm hitting daily situations where I have to
           | reconsider my use of ChatGPT because what I'm asking would
           | leak very private personal information or somehow trip its
           | morality filter.
           | 
           | Just swapped in a 2TB nvme for a separate /home mount and
           | reinstalled Arch just so I could have enough disk space to
           | test a dozen models locally. I'm so ready!
        
             | barking_biscuit wrote:
             | Only a few years? Nobody can predict accurately in years
             | anymore. Feels more like "only a few months" away.
        
             | meghan_rain wrote:
             | I pray to the AI gods that OpenAI will fail at calibrating
             | the censorship layer and will continue to overcensor, which
             | in turn will hopefully lead to many usecases requiring
             | local LLMs, which in turn would increase the incentive to
             | build them.
        
           | visarga wrote:
           | > the democratization and commodification of powerful
           | technology accessible and affordable to the public
           | 
           | Empowering everyone equally, very low barrier of entry. The
           | spread of AI will make some skills become less important and
           | other skills more important. There will be painful social
           | changes. Memorising stuff, being exact, being careful,
           | diligent - all these things can be done by LLM. But also
           | creative recombination of ideas. For now we still hold
           | supremacy in long form content, complex software projects and
           | detecting inconsistencies, but they could be emergent
           | qualities in GPT-5.
        
           | sbierwagen wrote:
           | If you thought crypto miners drove up the prices of GPUs,
           | just wait until a gaming card can host a person who can get a
           | job.
           | 
           | Integer-quantized LLaMA can fit on a 4090, and isn't quite
           | human-level. If a fully aligned, reliably non-hallucinary LLM
           | can fit on the next generation 5 series GPUs then expect
           | their market price to quadruple.
        
         | hiAndrewQuinn wrote:
         | I for one dream of a future without maps. I want to walk
         | through a distant forest to find an ancient, unconnected ESP-32
         | in the bark of a tree containing a tiny specialized AI that can
         | only tell me about things relevant to the area, how far to walk
         | upstream to the nearest town. And only if I can find it and
         | scan an RFID tag to wake it up.
        
           | vinc wrote:
           | A beautiful dream!
           | 
           | > I like to think (right now please!) of a cybernetic forest
           | filled with pines and electronics where deer stroll
           | peacefully past computers as if they were flowers with
           | spinning blossoms.
        
         | t_minus_2 wrote:
         | The cat is out of the bag,The genie is out of the bottle,The
         | confetti has left the cannon,The ship has sailed,The horse has
         | bolted,The toothpaste is out of the tube,The beans have been
         | spilled,The train has left the station,The die is cast,The bell
         | has been run.
        
         | cjf101 wrote:
         | Yes, this is true. But, I worry about how long it will take for
         | the utility of "GPT-4" on my phone to be close enough to
         | whatever is only possible through models running on large cloud
         | platforms to make that choice relatively drawback free.
         | 
         | Is the curve of what this class of algorithms can provide
         | sigmoid? If so, then yeah, eventually researchers should be
         | able to democratize it sufficiently that the choice to use
         | versions that can run on private hardware rational. But if the
         | utility increases linearly or better over time/scale, the
         | future will belong to whoever owns the biggest datacenters.
        
         | hintymad wrote:
         | I'd go one step further if it is not happening yet: smaller
         | companies should really pool their resources to train open
         | LLMs. Say, form a consortium and work with the open source
         | community to build ChatGPT-equivalent. Companies will be crazy
         | to assume that they can hand their future to the APIs offered
         | by a handful of companies during this monumental technological
         | paradigm shift in history.
         | 
         | That is, a real OpenAI with a open government body.
        
         | yieldcrv wrote:
         | > And smart people will package them as "one click installers"
         | for people who aren't tech-savvy. This is already happening.
         | 
         | Any projects I can follow? Because I haven't seen any one click
         | installers yet that didn't begin with "first install a package
         | manager on the command line"
        
           | slickdork wrote:
           | Not an llm but this 1 click installer for stable diffusion is
           | literally a 1 click installer. Impressively works.
           | 
           | https://github.com/cmdr2/stable-diffusion-ui
        
         | simon83 wrote:
         | Google: "confetti has left the cannon"
         | 
         | > No results found for "confetti has left the cannon".
         | 
         | I'm amazed that a "stochastic parrot" can come up with such a
         | beautiful idiom.
        
           | visarga wrote:
           | That name aged like milk. First of all, what you said. And
           | second - a parrot can make more parrots without human help,
           | language models can't make GPU chips. Insulting for both LLMs
           | and parrots.
        
             | educaysean wrote:
             | Language models exist not in the physical world but
             | digital. A more apt comparison would be if LLMs can compile
             | new (or slightly modified) LLMs, and I believe the answer
             | will soon be (if not already), a resounding yes.
        
               | visarga wrote:
               | I agree, LLMs can generate the source code for their
               | network, can generate the training data, supervise a
               | model being trained and explain step by step what they
               | do. That's why I specifically said they can't make GPUs,
               | if they could, they would be self replicators. We'd have
               | to consider them alive.
        
             | groestl wrote:
             | > language models can't make GPU chips
             | 
             | Not yet. I'm sure somebody is taking a hard look already.
        
           | flir wrote:
           | Could be lifted from another language. Is nice though.
        
           | barking_biscuit wrote:
           | Out of distribution generations are a thing.
        
             | simon83 wrote:
             | I understand that all of this is based on some fundamental
             | mathematics, a couple of algorithms here, and some
             | statistical analysis there. And I find it absolutely
             | amazing that we can use all of that knowledge and encode it
             | into something that resembles intelligence. This makes me
             | think if our brains and the concept of intelligence are
             | really as special and mysterious as we assume.
        
           | athom wrote:
           | Try looking up "pinkie pie party cannon"
        
           | tjr wrote:
           | Not exactly the same, but from:
           | 
           | https://www.instagram.com/p/CQdBiVyh5C2/?hl=en
           | 
           |  _Now that the cat is out of the bag, or, should I say the
           | confetti is out of the... can?_
        
           | 2bitencryption wrote:
           | For the curious, here was the conversation I had:
           | 
           | ME: What's another saying similar to: "The cat is out of the
           | bag" or "The genie is out of the bottle"?
           | 
           | chatgpt: Another similar saying is "the toothpaste is out of
           | the tube." Like the other two expressions, this phrase
           | conveys the idea that something has been revealed or
           | unleashed, and it cannot be undone or put back into its
           | original state.
           | 
           | ME: Can you invent a new phrase that has the same meaning,
           | which is completely new?
           | 
           | chatgpt: Certainly! Here's a new phrase with a similar
           | meaning: "The confetti has left the cannon." This phrase
           | suggests that once something is out in the open or has
           | happened, it's impossible to reverse or contain it, much like
           | confetti that has been shot out of a cannon during a
           | celebration.
           | 
           | ME: What's the name for this kind of expression? Aphorism?
           | Fable?
           | 
           | chatgpt: This kind of expression is known as an idiom. Idioms
           | are phrases or expressions with a figurative meaning that is
           | different from their literal meaning. They are commonly used
           | in everyday language to convey a specific message or
           | sentiment in a more colorful or memorable way.
        
             | Tommstein wrote:
             | That is impressive, especially given that neither Google
             | nor DuckDuckGo find any previous trace of that idiom.
        
             | johtso wrote:
             | Not exactly sure why it would be surprising that it can
             | come up with a convincing idiom when it can produce
             | remarkably good _poetry_
        
             | bilsbie wrote:
             | I don't understand why people aren't more impressed with it
             | clearly understanding and then even inventing idioms. That
             | shows some real intelligence.
        
               | fnordpiglet wrote:
               | It's because they're confused in thinking human
               | intelligence isn't learned stochastic expectation.
        
               | gary_0 wrote:
               | That's how the brain evolved, after all: take incoming
               | input, process according to previous patterns, output a
               | reaction to peripheral nervous system that will enhance
               | survival. The evolutionary purpose of a cortex is to
               | store a statistical model of an animal's environment
               | based on past stimulus. Human intelligence is just a
               | complicated permutation of these initial systemic
               | parameters.
        
         | [deleted]
        
       | johnchristopher wrote:
       | OT: I don't know about their scaling strategy for LLM but their
       | scaling strategy for displaying pictures is disappointing.
       | 
       | (it's all blurry)
        
         | lanshanbob wrote:
         | I think it's fixed now
        
         | ricopags wrote:
         | Came here to point this out, though not as pithily :D
         | 
         | Really, really bad mark on whoever is in charge of their web
         | marketing. Images should never look that bad, not even in
         | support, but definitely not in marketing.
         | 
         | edit: so this post is more useful, 4k res using Edge browser
        
         | Kelamir wrote:
         | Last time I viewed it, I believe it wasn't blurry. Perhaps to
         | scale the traffic the images are now displayed in lower
         | quality?
         | 
         | But I'm not sure anymore that it wasn't initially blurry...
         | Perhaps I'm hallucinating, like large language models.
         | 
         | Current image displayed is https://www.cerebras.net/wp-
         | content/uploads/2023/03/Scaling-... , will see if it changes.
        
           | Kelamir wrote:
           | I can confirm, it does change. As of now, it displays one of
           | higher quality: https://www.cerebras.net/wp-
           | content/uploads/2023/03/Scaling-...
        
         | thewataccount wrote:
         | They're dynamically scaled and something must be broken. If you
         | inspect source you can find the raw images, here's a few:
         | 
         | https://www.cerebras.net/wp-content/uploads/2023/03/Downstre...
         | 
         | https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...
         | 
         | https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...
         | 
         | EDIT: Looks like it scores better with less training - up until
         | it matches GPT-J/Pythia/OPT and doesn't appear to have much
         | benefit. It maybe scores slightly better then GPT-J which is
         | pretty "eh", I'm not sure if GPT-J level performance is really
         | useful for anything? NeoX 20B outperforms it in everything if
         | you don't care about the amount of training needed.
         | 
         | Does the better performance for less training matter if that
         | benefit only applies when it's only performing a lot worse then
         | GPT-J? It appears to lose it's scaling benefits before the
         | performance is interesting enough to matter?
        
           | [deleted]
        
           | johnchristopher wrote:
           | Disabling ublock on Firefox fixes the issue for me.
           | 
           | edit: scratch that, it seems the AJAX endpoint returns 504
           | more often that not.
        
             | thewataccount wrote:
             | Yeah it seems hit/miss depending on the window size and
             | stuff. My guess is whatever is supposed to decide what size
             | to request is messed up.
        
       | ftxbro wrote:
       | > Our paper, which will be available soon, will detail our
       | training methods and performance results.
       | 
       | Yay there will be a paper let's gooooooo!
        
       | chessgecko wrote:
       | I wonder what led to such a gap between llama 7b and Cerebras
       | 13b. I hope they discuss it in the paper.
        
         | gpm wrote:
         | Is there a benchmark comparing the two that I missed?
         | 
         | Edit: The huggingface page has 0-shot benchmarks which you can
         | compare against the llama paper
         | 
         | https://huggingface.co/cerebras/Cerebras-GPT-13B
         | 
         | https://arxiv.org/pdf/2302.13971.pdf
        
           | freeqaz wrote:
           | I'm on mobile and struggled to compare these two tables
           | properly. Would you mind posting a summary of your findings?
           | 
           | Here are some values but I don't know what they mean. LLama
           | 60B on the left, Cerebras 13B on the right.
           | 
           | PiQA: 82.8 / 76.6 WinoGrade: 77.0 / 64.6 ARC-e: 78.9 / 71.4
        
             | gpm wrote:
             | Really short summary: LLaMa is better, even smaller LLaMa
             | models.
             | 
             | Table format: Benchmark, Cerebras 13B, LLama 7B, LLama 13B,
             | LLama 60B
             | 
             | HellaSwag, 51.3, 76.1, 79.2, 84.2
             | 
             | Piqa, 76.6, 79.8, 80.1, 82.8
             | 
             | Wino-Grande, 64.6, 70.1, 73.0, 77.0
             | 
             | Arc-e, 71.4, 72.8, 74.8, 78.9
             | 
             | Arc-c, 36.7, 47.6, 52.7, 56.0
             | 
             | OpenBookQA, 28.6, 57.2, 56.4, 60.2
        
         | ftxbro wrote:
         | This gap makes sense to me. The academic point of the Cerebras
         | paper is to show their nice empirical scaling law for compute-
         | optimal training, whereas the academic point of the LLaMA paper
         | was to show that you can make small models punch above their
         | weight by training them in a way that is deliberately not
         | compute-optimal. Of course both of those publications had other
         | academic and marketing purposes.
         | 
         | From the Cerebras blog post: "Trained using the Chinchilla
         | formula, these models provide the highest accuracy for a given
         | compute budget."
         | 
         | From the LLaMA paper: "The focus of this work is to train a
         | series of language models that achieve the best possible
         | performance at various inference budgets, by training on more
         | tokens than what is typically used."
        
       | visarga wrote:
       | Of course this is great news, I hope these models can be fine-
       | tuned to be like lighter versions of chatGPT. But I remember
       | reading in the LLaMA paper that a small model can still improve
       | when trained more than the Chinchilla budget.
       | 
       | > For instance, although Hoffmann et al. (2022) recommends
       | training a 10B model on 200B tokens, we find that the performance
       | of a 7B model continues to improve even after 1T tokens.
       | 
       | Cerebras says:
       | 
       | > For instance, training a small model with too much data results
       | in diminishing returns and less accuracy gains per FLOP
       | 
       | But this is only of concern when you care about the training
       | cost, such as when you are budget limited researcher or a company
       | who doesn't deploy models at scale. But when you care about the
       | total cost of deployment, then making a small model even better
       | with lots of data is a smart move. In the end it matters more to
       | have the most efficient model in prediction, not the most
       | efficient model in training.
        
       | skybrian wrote:
       | What's in the Pile training data they used? How much source code
       | does it include?
        
       | Garcia98 wrote:
       | I've been following open source LLMs for a while and at first
       | glance this doesn't seem too powerful compared to other open
       | models, Flan-Alpaca[0] is licensed under Apache 2.0, and it seems
       | to perform much better. Although I'm not sure about the
       | legalities about that licensing, since it's basically Flan-T5
       | fine-tuned using the Alpaca dataset (which is under a Non-
       | Commercial license).
       | 
       | Nonetheless, it's exciting to see all these open models popping
       | up, and I hope that a LLM equivalent to Stable Diffusion comes
       | sooner than later.
       | 
       | [0]: https://github.com/declare-lab/flan-alpaca
        
         | ftxbro wrote:
         | Their goal isn't to make a powerful model. It's to show how
         | well compute-optimal models do on test-loss as a function of
         | increasing model size. This function can be used with some
         | caveats to forecast the test-loss of larger models for which
         | compute-optimality becomes more important.
        
         | alchemist1e9 wrote:
         | Sounds like you might be the right person to ask the "big"
         | question.
         | 
         | For a small organization or individual who is technically
         | competent and wants to try and do self-hosted inference.
         | 
         | What open model is showing the most promise and how does it's
         | results compare to the various openAI GPTs?
         | 
         | A simple example problem would be asking for a summary of code.
         | I've found openAI's GPT 3.5 and 4 to give pretty impressive
         | english descriptions of code. Running that locally in batch
         | would retain privacy and even if slow could just be kept
         | running.
        
           | Garcia98 wrote:
           | Google's Flan-T5, Flan-UL2 and derivatives, are so far the
           | most promising open (including commercial use) models that I
           | have tried, however they are very "general purpose" and don't
           | perform well in specific tasks like code understanding or
           | generation. You could fine-tune Flan-T5 with a dataset that
           | suits your specific task and get much better results, as
           | shown by Flan-Alpaca.
           | 
           | Sadly, there's no open model yet that acts like a Swiss knife
           | and gets good-enough results for multiple use cases.
        
       | brucethemoose2 wrote:
       | FYI: Cerebras's nodes are _very_ different than your typical
       | Nvidia training nodes:
       | 
       | https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...
       | 
       | Each individual "chip" has _40GB of SRAM_ vs ~76MB for the Nvidia
       | H100, and networked pools of external RAM, SSDs and such. Thats
       | why the training architecture is so different.
        
         | arbuge wrote:
         | https://www.cerebras.net/product-chip/
         | 
         | There's a comparison picture there of one of their chips
         | alongside a regular GPU chip. Effectively they use up the
         | entire wafer.
        
           | brucethemoose2 wrote:
           | Yeah, and that doesn't even do the nutty IO on these things
           | justice.
           | 
           | A 16x CS2 cluster like they describe is like a huge Nvidia
           | cluster in terms of throughput, but more like a single Nvidia
           | node structurally.
        
             | ipsum2 wrote:
             | Cerebras makes impressive hardware, but Nvidia still
             | performs better in every regard. The most telling factor is
             | that Cerebras claims they're too busy to run common
             | benchmarking (e.g. MLPerf) to compare against Nvidia.
        
             | alchemist1e9 wrote:
             | It's unbelievable stuff. Does anyone know how much a single
             | box costs? They are selling them it looks like.
        
               | freeqaz wrote:
               | If you have to ask, you can't afford it!
               | 
               | Mostly teasing but my guess would be $500k+ since they'll
               | likely price it so that it is the same $ as the
               | equivalent NVIDIA cluster (or very close to it).
        
               | alchemist1e9 wrote:
               | Actually if they are around $2M looks like my company can
               | afford one. Given this is just getting started it looks
               | promising as I'm sure future generations will be more
               | affordable.
        
               | ericd wrote:
               | On the order of several million USD for the second gen
               | system. Last I heard, they're still at lowish volumes,
               | selling some to national labs and the like.
        
               | sbierwagen wrote:
               | CS-1 costs "$2-3 million", CS-2 costs "several" million.
               | 
               | A single Nvidia H100 costs somewhere around $30,000 each,
               | so a GPU server with every slot populated costs about
               | $300,000.
        
               | brucethemoose2 wrote:
               | ServeTheHome claims "HGX A100 platforms, when they are
               | sold as single servers are generally in the $130K-$180K
               | even leaving a very healthy margin for OEMs/ resellers"
               | 
               | https://www.servethehome.com/graphcore-celebrates-a-
               | stunning...
               | 
               | Not sure about the H100, but it seems to be more supply
               | constrained (hence pricier) atm.
               | 
               | Now, the real question is how many HGX nodes "equals" a
               | single CS2 node. The math here is extremely fuzzy, as the
               | benefit to such extreme node consolidation depends on the
               | workload, and the CS-2 takes up less space, but the HGX
               | cluster will have more directly accessible RAM and better
               | turnkey support for stuff since its Nvidia.
        
               | bubblethink wrote:
               | There is cloud pricing on the website.
               | https://www.cerebras.net/product-cloud/
        
               | alchemist1e9 wrote:
               | This is actually really important from my perspective. It
               | looks like an end user can work backwards from available
               | inference hardware, or interference budget, required
               | speed, then figure out a viable model size. Bring their
               | own data and then fine tune or train from scratch.
               | 
               | This is getting so real so fast.
        
       | tombert wrote:
       | Has anyone tried this? I have 96GB of GPU memory; will that be
       | enough to run the biggest model?
        
         | cuuupid wrote:
         | 13B fits nicely even in a 3090 (24gb vram)!
        
           | Filligree wrote:
           | Only if you can use int8, which may or may not be convenient.
        
           | lxe wrote:
           | with load_in_8bits ?
        
         | spi wrote:
         | I have not tried, but 96GB of GPU memory is plenty, for
         | inference there should certainly be no issue. Their biggest
         | model has 13B parameters, you should be able to run inference
         | (float16) already with 32GB of memory.
         | 
         | With 96GB of memory you should also be able to fine-tune it
         | (possibly some tricks like gradient accumulation and/or
         | checkpointing might be needed), but you have to be ready for
         | many days of computation...
        
           | alchemist1e9 wrote:
           | > but you have to be ready for many days of computation...
           | 
           | I was thinking since we have API prices in tokens and now it
           | looks like self hosted inference on high end GPUs for similar
           | models. Then based on electricity prices there will be a
           | self-hosted prices in tokens. Then how close are these
           | already? What is the markup today from roughly the raw
           | electricity cost that OpenAI has.
        
       | wg0 wrote:
       | Noob to ML in practice. These models containing weights, all of
       | them, do they have a standard file/binary format?
        
         | examplary_cable wrote:
         | [I'm not an expert] but I believe .ckpt and .safetensors.
         | 
         | The problem with .ckpt is that it executes arbitrary code in
         | your machine(very unsafe). While .safetensors was made by
         | huggingface in order to have a safe format to store the
         | weights. I've also seen people load up the llama 7B via a .bin
         | file.
        
       | eldenring wrote:
       | > Trained using the Chinchilla formula, these models provide the
       | highest accuracy for a given compute budget.
       | 
       | I'm confused as to why 111 million parameter models are trained
       | with the Chinchilla formula. Why not scale up the training data?
       | If you're training smaller models, surely optimizing performance
       | is better than optimizing total compute.
       | 
       | Seems like a silly misunderstanding of the Chinchilla paper, but
       | I'm sure I'm missing something
        
         | gamegoblin wrote:
         | True. There was a good blog post published about this a few
         | weeks ago: https://finbarr.ca/llms-not-trained-enough/
         | 
         | Money quote for those who don't want to read the whole thing:
         | 
         | '''
         | 
         | When people talk about training a Chinchilla-optimal model,
         | this is what they mean: training a model that matches their
         | estimates for optimality. They estimated the optimal model size
         | for a given compute budget, and the optimal number of training
         | tokens for a given compute budget.
         | 
         | However, when we talk about "optimal" here, what is meant is
         | "what is the cheapest way to obtain a given loss level, in
         | FLOPS." In practice though, we don't care about the answer!
         | This is exactly the answer you care about if you're a
         | researcher at DeepMind/FAIR/AWS who is training a model with
         | the goal of reaching the new SOTA so you can publish a paper
         | and get promoted. If you're training a model with the goal of
         | actually deploying it, the training cost is going to be
         | dominated by the inference cost. This has two implications:
         | 
         | 1) there is a strong incentive to train smaller models which
         | fit on single GPUs
         | 
         | 2) we're fine trading off training time efficiency for
         | inference time efficiency (probably to a ridiculous extent).
         | 
         | Chinchilla implicitly assumes that the majority of the total
         | cost of ownership (TCO) for a LLM is the training cost. In
         | practice, this is only the case if you're a researcher at a
         | research lab who doesn't support products (e.g. FAIR/Google
         | Brain/DeepMind/MSR). For almost everyone else, the amount of
         | resources spent on inference will dwarf the amount of resources
         | spent during training.
         | 
         | '''
        
           | haldujai wrote:
           | While true I think this also misses that "for almost everyone
           | else" you're probably not (or at least should not) be trying
           | to optimize zero-shot performance if you have an intended
           | high inference use case so I don't think Chinchilla would be
           | all that relevant.
        
             | vintermann wrote:
             | I have a suspicion that good zero-shot performance is a
             | good starting point for fine-tuning. If you have more than
             | one intended high inference use case, or can imagine a
             | couple of new ones on the horizon, it might still be best
             | to not target the first use case directly.
        
         | haldujai wrote:
         | You're not wrong, the Chinchilla rationale is that it may be
         | more compute efficient to obtain a given loss using larger
         | model sizes if the budget allows. As another commenter states
         | this ignore the inference part of the equation.
         | 
         | As an example the BERT/RoBERTa family were trained for much
         | longer than Chinchilla, you do get diminishing returns though.
         | 
         | There is a point of overtraining where downstream performance
         | is impacted but that's pretty high.
         | 
         | I think part of the answer to this is also that xxx million
         | parameter decoder-only models don't seem to be that useful so
         | it may not be worthwhile to optimize them for performance?
        
         | ftxbro wrote:
         | The point of those smaller models is for the "Cerebras Scaling
         | Law for Compute-Optimal Training" which is the straight line
         | plot in the image at the top of their webpage when you click
         | the link.
         | 
         | They want you to think it's reasonable that because the line is
         | so straight (on a flops log scale) for so long, it could be
         | tempting to extrapolate the pile-loss consequences of
         | continuing compute-optimal training for larger models beyond
         | their largest 13B one, with the obvious caveat that the
         | extrapolation can't continue linearly much further if for no
         | other reason than the test loss isn't going to go below zero
         | (it will flatten out sooner than that).
         | 
         | If you trained beyond compute-optimality on smaller models, it
         | would mess up their straight line and make it look like we are
         | sooner hitting diminishing returns on test loss.
        
       | simonw wrote:
       | "Cerebras open sources seven GPT-3 models from 111 million to 13
       | billion parameters."
       | 
       | I don't understand why they describe them as GPT-3 models here as
       | opposed to calling them GPT models. Or even LLMs - but I guess
       | that acronym isn't as widely recognized.
        
         | wsgeorge wrote:
         | I think GPT-3 is used as a benchmark for performance, so saying
         | a model is on par with GPT-3 should give you an idea of what
         | you can get out of it.
         | 
         | IIRC most open source models to date - including the semi-open
         | LLaMAs - have GPT-3-like performance. Nothing gets close to
         | GPT-3.5 and beyond.
        
       | patientplatypus wrote:
       | [dead]
        
       ___________________________________________________________________
       (page generated 2023-03-28 23:00 UTC)