[HN Gopher] Cerebras-GPT: A Family of Open, Compute-Efficient, L... ___________________________________________________________________ Cerebras-GPT: A Family of Open, Compute-Efficient, Large Language Models Author : asb Score : 341 points Date : 2023-03-28 16:34 UTC (6 hours ago) (HTM) web link (www.cerebras.net) (TXT) w3m dump (www.cerebras.net) | ivanvas wrote: | Is it currently possible to find-tune any of the foundation | modules available on a few Gb of unsupervised text? | simonw wrote: | Does the chinchilla recipe still hold today? I got the impression | that the LLaMA paper proposed a different result where throwing | far more tokens at the problem had a very meaningful impact, or | did I misunderstand that? | evanmays wrote: | There's discussion elsewhere in this thread what chinchilla | actually means. I'll only compare it to llama. | | Tldr; Chinchilla isn't wrong, it's just useful for a different | goal than the llama paper. | | There's 3 hyper parameters to tweak here. Model size (parameter | count), number of tokens pre trained on, and amount of compute | available. End performance is in theory a function of these | three hyperparameters. | | You can think of this as an optimization function. | | Chinchilla says, if you have a fixed amount of compute, here's | what size and number of tokens to train for maximum | performance. | | A lot of times, we have a fixed model size though though, | because size impact inference costs and latency. Llama operates | in this territory. They choose to fix the model size instead of | the amount of compute. | | This could explain gaps in performance between Cerebras models | of size X and llama models of size X. Llama models of size X | have way more compute behind them | espadrine wrote: | I don't think it holds for two reasons. | | First, it only holds for a given architecture and | implementation. Obviously, a different architecture will have a | different training slope. This is clear when comparing LSTM | with Transformers, but is also true between transformers that | use prenorm/SwiGLU/rotary-positional, and those that follow | Vaswani 2017. | | In terms of implementation, some algorithms yield the same | result with fewer operations (IO, like FlashAttention and other | custom CUDA kernels, and parallelism, like PaLM, which both | came after Chinchilla), which unambiguously affect the Tflops | side of the Chinchilla equation. Also, faster algorithms and | better parallelization will yield a given loss sooner, while | less power-hunger setups will do that cheaper. | | Second, even in the original Chinchilla paper in figure 2, some | lines are stopped early before reaching Pareto (likely because | it ran out of tokens, but LLaMA makes it seem that >1 epoch | training is fine). | whalesalad wrote: | This "AI spring" is really snowballing with the crazy nouns and | terminology. Alpaca, llama and now chinchilla?? | murkt wrote: | Chinchilla actually came before alpaca and llama. | | Every new variation of model gets some new name, just like | every library gets a new name. There were all kinds of BERTs | before - DistilBert, Roberta, SciBERT, Schmobert, Schmuber, | etc. Many hundreds of them, I think. | whoisnnamdi wrote: | Chinchilla actually came first! | jhbadger wrote: | As mentioned, chinchilla is not part of this trend, and | chinchillas are rodents. Alpacas and llamas are South American | camelids (animals related to camels). So if additional names | are needed, I would expect them to be vicuna and guanaco, as | they are also in the group. | mometsi wrote: | I think the relevant category is "Adorable Fuzzy Critters of | the Andes". See also | https://en.wikipedia.org/wiki/Spectacled_bear | ramesh1994 wrote: | The term "chinchilla" predates llama/alpaca. It doesn't | directly map to a specific model, rather a family of compute- | optimal models. | mometsi wrote: | Summary: This is a company that makes AI accelerator ICs. They | reimplemented Chinchilla and released the model weights under a | permissive license. | bogwog wrote: | In other words, they're actually incentivized to help make LLMs | as accessible as possible, rather than try to keep them locked | up to hide them from competitors. | | Which makes me wonder if Nvidia is doing anything with LLMs | too? | vintermann wrote: | NVidia has certainly pushing the envelope on image | generation. StyleGAN3 was really cool when it came. | | But it is an issue that their chips are hardly optimized for | LLMs. | meghan_rain wrote: | How can a GPU be optimized for StyleGAN but not LLMs? | Serious question. | MacsHeadroom wrote: | RAM. GPT-3 is over 600GB, ie just the max RAM of 8xA100s, | because that's all the hardware can fit. | | StableDiffusion plus a whole chain of imagenets can make | any visual imagery imaginable in 2GB of RAM. Meanwhile | 2GB of RAM barely runs a basic tiny text completion NN | that can't do anything intelligent. Text requires a lot | more parameters (and more memory/RAM) than images. | brucethemoose2 wrote: | The Cerebras node's actual "RAM" (the 40GB of SRAM) is | pretty modest too, but being an _enormous_ chip with the | networked storage pools is certainly a better situation | than a bunch of A100s reaching out to every other A100. | | Honestly, all the AI ASIC makers drastically | underestimated the RAM requirements of future models. | Graphcore's 4GB and Tenstorrent's 8GB per IC is kinda | laughable, and it takes them longer to adjust than | Nvidia. And Cerebras' original pitch was "fit the entire | model into SRAM!" | option wrote: | obviously yes, there are larger (20B) models trained on more | tokens that are released by NVIDIA under commercial-friendly | license on HF | | https://huggingface.co/nvidia/nemo-megatron-gpt-20B | antimatter15 wrote: | Looking at their charts it seems like their 6.7B model is | considerably worse than GPT-J which is an existing open 6B model | from several years ago. | | I wish rather than stopping training early they would have run | more data through a small model so we could have something more | competitive with LLaMA 7B. | cs-fan-101 wrote: | Someone posted this repost from the Cerebras Discord earlier, | but sharing for visibility - | | "We chose to train these models to 20 tokens per param to fit a | scaling law to the Pile data set. These models are optimal for | a fixed compute budget, not necessarily "best for use". If you | had a fixed parameter budget (e.g., because you wanted to fit | models on certain hardware) you would train on more tokens. We | do that for our customers that seek that performance and want | to get LLaMA-like quality with a commercial license" | HanClinto wrote: | Sounds like we should crowd-fund the cost to train and open | source one of these models with LLaMa-like quality. | | I'd chip in! | brucethemoose2 wrote: | TBH that seems like a good job for Cerebras. | | There are plenty of such efforts, but the organizer needs | some kind of significance to attract a critical mass, and a | AI ASIC chip designer seems like a good candidate. | | Then again, maybe they prefer a bunch of privately trained | models over an open one since that sells more ASIC time? | brucethemoose2 wrote: | > Cerebras Discord | | This is really weird to hear out loud. | | I still think of Discord as a niche gaming chatroom, even | though I know that (for instance) a _wafer scale IC design | company_ is hosting a Discord now. | simonw wrote: | You can try out some of these models on Hugging face here: | https://huggingface.co/cerebras/Cerebras-GPT-1.3B | | That was the largest that had inference enabled - I'd really like | to try this one: https://huggingface.co/cerebras/Cerebras-GPT-13B | amilios wrote: | Comparing the 13B model here | https://huggingface.co/cerebras/Cerebras-GPT-13B to LLaMA-13B | https://github.com/facebookresearch/llama/blob/main/MODEL_CA... | you can see that in all of the reasoning tasks Cerebras-GPT lags | behind. Any reason to use Cerebras instead of LLaMA? Doesn't seem | like it. | option wrote: | it lags behind because according to their blogpost it was | trained on <300B tokens. LLaMAs as far as I know were trained | on more than trillion | gpm wrote: | The LLaMa paper says 1 trillion for the smaller models (7B, | 13B) and 1.4 trillion for the larger models (30B, 65B) | [deleted] | mdagostino wrote: | LLaMA is non-commercial | potatoman22 wrote: | Can the LLaMA weights be used for commercial products? | espadrine wrote: | There are two aspects to it. | | The first one is whether they would actually sue. The optics | would be terrible. A similar situation occurred in the 90s | when the RC4 cipher's code was leaked. Everyone used the | leaked code pretending that it was a new cipher called | arc4random, even though they had confirmation from people | that licensed the cipher that its output was identical. | Nobody was sued, and the RSA company never acknowledged it. | | The second one is related to the terms. The LLaMA weights | themselves are licensed under terms that exclude commercial | use:[0] | | > _You will not [...] use [...] the Software Products (or any | derivative works thereof, works incorporating the Software | Products, or any data produced by the Software), [...] for | [...] any commercial or production purposes._ | | But the definition of derivative works is gray. AFAIK, if | LLaMA is distilled, there is an unsettled argument to be had | that the end result is not a LLaMA derivative, and cannot be | considered copyright or license infringement, similar to how | models trained on blog articles and tweets are not infringing | on those authors' copyright or licensing. The people that | make the new model may be in breach of the license if they | agreed to it, but maybe not the people that use that new | model. Otherwise, ad absurdum, a model trained on the | Internet will have content that was generated by LLaMA in its | training set, so all models trained on the Internet after Feb | 2023 will break the license. | | IANAL, but ultimately, Meta wins more by benefiting from what | the community contributes on top of their work (similar to | what happened with React), than by suing developers that use | derivatives of their open models. | | [0]: https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2 | jP4Z... | gpm wrote: | Unclear, likely jurisdiction dependent, almost certainly not | if you need to operate world wide. | binarymax wrote: | Here are the zero-shot accuracy numbers posted in the Huggingface | evaluations for Cerebras-GPT 13B vs. the results of LLaMa 13B in | their paper: Model BoolQ PIQA SIQA | HellaSwag WinoGrande ARC-e ARC-c OBQA LLaMa 13B | 78.1 80.1 50.4 79.2 73 74.8 52.7 56.4 | Cerebras-GPT 13B - 76.6 - 51.3 64.6 71.4 | 36.7 28.6 | wsgeorge wrote: | I guess it's something. It still goes to show how far open | models are behind the proprietary SOTA. | topynate wrote: | From their discord: | | > It would be interesting to know why you chose those FLOPS | targets, unfortunately it looks like the models are quite | under pre-trained (260B tokens for 13B model) | | > We chose to train these models to 20 tokens per param to | fit a scaling law to the Pile data set. These models are | optimal for a fixed compute budget, not necessarily "best for | use". If you had a fixed parameter budget (e.g., because you | wanted to fit models on certain hardware) you would train on | more tokens. We do that for our customers that seek that | performance and want to get LLaMA-like quality with a | commercial license | | Which is the point made elsewhere in these comments, e.g. | https://news.ycombinator.com/item?id=35344192, and also | usefully shows how open Cerebras are. They're pretty open, | but not as much as they would be if they were optimising for | filling in other companies' moats. | binarymax wrote: | Indeed but this is zero-shot performance. Fine-tuning for a | task should get you pretty good results. I'm interested in | seeing the results of an Alpaca method against this Cerebras | 13B model. | MacsHeadroom wrote: | >I'm interested in seeing the results of an Alpaca method | | You're talking apples to oranges. The "Alpaca method" is a | dataset generation method. Nothing about Alpaca's training | method is novel, interesting, or efficient. Alpaca used the | same standard training method everyone else uses, A100 | clusters. | | If you mean LoRA/PEFT training which people used to | replicate Alpaca then that is also apples to oranges | because LoRA/PEFT is a finetuning method not a pre-training | method. | [deleted] | Vetch wrote: | Base model performance is what's most important and also | impacts fine-tuning quality. Practically, a model that's | good out of the box with minimal fine-tuning is also useful | to more people. Since they focused on being training | compute optimal for some budget, expect their models to lag | behind Llama overall. Their 6.7B version should lag behind | GPT-J, assuming 20 tokens per parameter. | | The Pythia models are also worth checking out, they might | be better than or matched to CerebrasGPTs at each size | (although they warn it is not intended for deployment). | | Conclusion: the landscape of top open models remains | unchanged. | rnosov wrote: | I might be missing something but it looks to me that actually | running this "open" model requires special hardware only | accessible with a cloud subscription with 60 000 USD / week | minimum spend[1]. Can anyone confirm if you can run it on your | own hardware? If software is open but hardware is locked I don't | see the point. | | [1] https://www.hpcwire.com/2021/09/16/cerebras-wafer-scale- | engi.... | | EDIT: Ok, looks like I've missed the hugging face repo. The | language they use is a bit confusing. | bubblethink wrote: | You can run inference on GPUs. These are just models and | weights. | simonw wrote: | The PyTorch model files are already available to download from | Hugging Face - the largest one looks to be 52GB. They should | run on any hardware that can run regular PyTorch models. | JamesCoyne wrote: | Slightly off-topic: | | I remember seeing news about the enormous chip Cerebras was/is | selling (pdf | https://f.hubspotusercontent30.net/hubfs/8968533/WSE-2%20Dat...). | | Has there been any indication that the LLMs released in the last | few months use exotic hardware like this, or is it all "standard" | hardware? | wmf wrote: | OpenAI uses Nvidia GPUs and Google uses their TPUs. | ipsum2 wrote: | Everyone except Google uses Nvidia for training. Cerebras, | Gaudi, and other custom AI accelerators have unable to surpass | Nvidia in performance/$ and performance/watt yet. | brucethemoose2 wrote: | You might see more training on Intel XPUs when they come out, | since they have such enormous RAM pools. Maybe AMD MI300s and | Intel Ponte Vecchio (both 128GB) in the shorter term, though I | think they will mostly be in HPC supercomputers instead of | cloud instances. | eternalban wrote: | > It takes substantial technical expertise to train very large | models on GPUs. In the recently released GPT-4 Technical Report, | OpenAI credits over thirty contributors just for compute | infrastructure and scaling. | | This is called a silver lining for some (in case you were worried | about gpt taking your job). Privacy requirements alone will in | the near term force major companies to run their own inference | (if not training). The expertise required are nearly identical to | that of running large scale distributed computational graphs. | | This is an interesting diveragence from what happened with web. | The backends started out simple before map-reduce and before | deconstructing databases and processing distributed logs. With | ML, we'll jump right into the complex backends in tandem with | easy-picking early stage edge applications (which we see daily on | HN). | 2bitencryption wrote: | This type of article (or press release, or whatever you want to | call it) is exactly what makes the future so interesting. | | The cat is out of the bag, the genie is out of the bottle, the | confetti has left the cannon[0]. | | It's tempting to see a world dominated by Google Bard, ChatGPT, | Bing Search, etc. And no doubt, they will be huge players, with | services that are far more powerful than anything that can be run | on the edge. | | But. BUT. The things that we can do on the edge are _incredible_ | now. Just imagine a year from now, or two. These earth-shattering | models, which seem to be upending a whole industry, will soon | have equivalents that run on the edge. Without services spying on | your data. Without censorship on what the model can /cannot say. | Because it's all local. | | When was the last time this happened? There _will_ be players who | publish weights for models that are free to use. The moment that | torrent magnet link is published, it 's out in the wild. And | smart people _will_ package them as "one click installers" for | people who aren't tech-savvy. This is already happening. | | So every time you're amazed by something chat-gpt4 says, remember | that soon this will be in your pocket. | | [0] the "confetti" idiom brought to you by chat-gpt4. | jazzkingrt wrote: | Serious question: is it typical to describe client-side | computing as "on the edge"? | | I thought running something on the edge referred to running it | in close network proximity to the user, rather than users | having control and running things themselves. | dragonwriter wrote: | > Serious question: is it typical to describe client-side | computing as "on the edge"? | | Somewhat; its consistent with, e.g., Google's "Edge TPU" | designation for its client-side neural processors. | | > I thought running something on the edge referred to running | it in close network proximity to the user | | Typically, but on the client device _is_ the limit-case of | "close network proximity to the user", so the use is | consistent. | capableweb wrote: | Yes, "edge computing" can refer to both computing done as | close to the user as possible geographically, or even on the | device itself. If someone says "I wanna do edge computing" | it's not clear enough to know if they just want to have | servers they control as close to the user as possible, or do | the computing on the device itself. I think Apple would say | "edge computing" is on the actual device while CloudFlare | would say "edge computing" is on their infrastructure, but | distributed to be physically closer to the end user. | iamerroragent wrote: | I guess I've been out of the loop for a bit and didn't | realize that "edge computing" became a term since cloud | computing took off. | | It is kind of cyclical then is not? | | By that I mean computers used to be shared and to log into | it through a terminal. | | Then the PC came around. | | Then about 15 years ago Cloud computing became the rage | (really an extension or more sophisticated system than the | first time shared computers) | | Now we're back to local computing. I even see more self | hosting and moving away from cloud due to costs. | | All that rant is to say is it's interesting. | | Side note, getting this AI to be localized as much as | possible I imagine will be really useful in the medical | industry because it helps alleviate HIPAA requirements. | nordsieck wrote: | > It is kind of cyclical then is not? | | > By that I mean computers used to be shared and to log | into it through a terminal. | | > Then the PC came around. | | > Then about 15 years ago Cloud computing became the rage | (really an extension or more sophisticated system than | the first time shared computers) | | There's a really neat article called "The Eternal | Mainframe"[1] that you might be interested. It explores | this idea in greater depth. | | --- | | 1. http://www.winestockwebdesign.com/Essays/Eternal_Mainf | rame.h... | iamerroragent wrote: | Thanks, that was an interesting read! | | I wonder if the author's perspective has changed with | regards to freedom to compute. | | Social Media is often used as an example of privacy | invasion though I've failed to see why concerns over | Facebook handling your private data is worrying when they | don't have a product you need to have. | | Email on the other hand, is pretty much a necessity today | so privacy concerns are vital there imo. Of course you | can host your own server whereas you can't host your own | Facebook. | wsgeorge wrote: | I believe this has been extended to mean "on device", which | is interesting. See Gerganov's article on Github [0]. I wrote | about this here [1] where I made a contrast between the core | and the edge. I think the term maps well to this meaning. | | What I find more interesting is that in the classic "close | network proximity", some parts of the world may not have | benefited as much from that trend since the closest nodes of | a global delivery network could be several countries away. | | [0] https://github.com/ggerganov/llama.cpp/discussions/205 | | [1] https://medium.com/sort-of-like-a-tech-diary/consumer-ai- | is-... | TeMPOraL wrote: | > _I believe this has been extended to mean "on device", | which is interesting._ | | I don't like the connotations this carries. This is almost | openly talking about reaching all the way into peoples' | hardware to run your software, for your benefit, on them, | without their knowledge, consent or control... | wsgeorge wrote: | I see. Hadn't considered this. Yes, I see how that might | be a concern. | | What I think is important in this AI Spring is that we | make it possible for people to run their own models on | their own hardware too, without having to submit anything | to a large, centralised model for inference. | matchagaucho wrote: | An LLM running locally providing type-ahead completions seems | inevitable. | lioeters wrote: | Yes, yes, and yes. I'm waiting for an actually open AI that can | run on the edge, purely on commodity hardware like our laptops | and phones - it's inevitable. | | I imagine this "cat out of the bag" situation, the | democratization and commodification of powerful technology | accessible and affordable to the public, is similar to what's | happening with single-board computers and microcontrollers like | Raspberry Pi, Arduino, ESP32. | | It might be similar to what happened with mobile phones, but | there the power was quite restricted. The (mostly) duopoly of | iOS and Android, with devices and apps locked down in various | ways. Sure we can "jail break" and "root" our phone, but that's | not for the general public. | | Maybe solar energy production is going through a similar | process, with panels and batteries becoming more efficient and | affordable every year. | | Certainly, it reminds one of the history of personal computers, | the way such a powerful general-purpose tool became ubiquitous | and local. | akiselev wrote: | After using ChatGPT 4 extensively for a few days, I think | we're probably only a few years away from the first | generation of truly conversational assistants ala Jarvis in | Iron Man. Between LangChain and existing voice recognition | software, we've already 95% of the way there, it just needs | to be packaged up into a UI/UX that makes sense. | | These local models are absolutely critical for that to happen | though. I'm hitting daily situations where I have to | reconsider my use of ChatGPT because what I'm asking would | leak very private personal information or somehow trip its | morality filter. | | Just swapped in a 2TB nvme for a separate /home mount and | reinstalled Arch just so I could have enough disk space to | test a dozen models locally. I'm so ready! | barking_biscuit wrote: | Only a few years? Nobody can predict accurately in years | anymore. Feels more like "only a few months" away. | meghan_rain wrote: | I pray to the AI gods that OpenAI will fail at calibrating | the censorship layer and will continue to overcensor, which | in turn will hopefully lead to many usecases requiring | local LLMs, which in turn would increase the incentive to | build them. | visarga wrote: | > the democratization and commodification of powerful | technology accessible and affordable to the public | | Empowering everyone equally, very low barrier of entry. The | spread of AI will make some skills become less important and | other skills more important. There will be painful social | changes. Memorising stuff, being exact, being careful, | diligent - all these things can be done by LLM. But also | creative recombination of ideas. For now we still hold | supremacy in long form content, complex software projects and | detecting inconsistencies, but they could be emergent | qualities in GPT-5. | sbierwagen wrote: | If you thought crypto miners drove up the prices of GPUs, | just wait until a gaming card can host a person who can get a | job. | | Integer-quantized LLaMA can fit on a 4090, and isn't quite | human-level. If a fully aligned, reliably non-hallucinary LLM | can fit on the next generation 5 series GPUs then expect | their market price to quadruple. | hiAndrewQuinn wrote: | I for one dream of a future without maps. I want to walk | through a distant forest to find an ancient, unconnected ESP-32 | in the bark of a tree containing a tiny specialized AI that can | only tell me about things relevant to the area, how far to walk | upstream to the nearest town. And only if I can find it and | scan an RFID tag to wake it up. | vinc wrote: | A beautiful dream! | | > I like to think (right now please!) of a cybernetic forest | filled with pines and electronics where deer stroll | peacefully past computers as if they were flowers with | spinning blossoms. | t_minus_2 wrote: | The cat is out of the bag,The genie is out of the bottle,The | confetti has left the cannon,The ship has sailed,The horse has | bolted,The toothpaste is out of the tube,The beans have been | spilled,The train has left the station,The die is cast,The bell | has been run. | cjf101 wrote: | Yes, this is true. But, I worry about how long it will take for | the utility of "GPT-4" on my phone to be close enough to | whatever is only possible through models running on large cloud | platforms to make that choice relatively drawback free. | | Is the curve of what this class of algorithms can provide | sigmoid? If so, then yeah, eventually researchers should be | able to democratize it sufficiently that the choice to use | versions that can run on private hardware rational. But if the | utility increases linearly or better over time/scale, the | future will belong to whoever owns the biggest datacenters. | hintymad wrote: | I'd go one step further if it is not happening yet: smaller | companies should really pool their resources to train open | LLMs. Say, form a consortium and work with the open source | community to build ChatGPT-equivalent. Companies will be crazy | to assume that they can hand their future to the APIs offered | by a handful of companies during this monumental technological | paradigm shift in history. | | That is, a real OpenAI with a open government body. | yieldcrv wrote: | > And smart people will package them as "one click installers" | for people who aren't tech-savvy. This is already happening. | | Any projects I can follow? Because I haven't seen any one click | installers yet that didn't begin with "first install a package | manager on the command line" | slickdork wrote: | Not an llm but this 1 click installer for stable diffusion is | literally a 1 click installer. Impressively works. | | https://github.com/cmdr2/stable-diffusion-ui | simon83 wrote: | Google: "confetti has left the cannon" | | > No results found for "confetti has left the cannon". | | I'm amazed that a "stochastic parrot" can come up with such a | beautiful idiom. | visarga wrote: | That name aged like milk. First of all, what you said. And | second - a parrot can make more parrots without human help, | language models can't make GPU chips. Insulting for both LLMs | and parrots. | educaysean wrote: | Language models exist not in the physical world but | digital. A more apt comparison would be if LLMs can compile | new (or slightly modified) LLMs, and I believe the answer | will soon be (if not already), a resounding yes. | visarga wrote: | I agree, LLMs can generate the source code for their | network, can generate the training data, supervise a | model being trained and explain step by step what they | do. That's why I specifically said they can't make GPUs, | if they could, they would be self replicators. We'd have | to consider them alive. | groestl wrote: | > language models can't make GPU chips | | Not yet. I'm sure somebody is taking a hard look already. | flir wrote: | Could be lifted from another language. Is nice though. | barking_biscuit wrote: | Out of distribution generations are a thing. | simon83 wrote: | I understand that all of this is based on some fundamental | mathematics, a couple of algorithms here, and some | statistical analysis there. And I find it absolutely | amazing that we can use all of that knowledge and encode it | into something that resembles intelligence. This makes me | think if our brains and the concept of intelligence are | really as special and mysterious as we assume. | athom wrote: | Try looking up "pinkie pie party cannon" | tjr wrote: | Not exactly the same, but from: | | https://www.instagram.com/p/CQdBiVyh5C2/?hl=en | | _Now that the cat is out of the bag, or, should I say the | confetti is out of the... can?_ | 2bitencryption wrote: | For the curious, here was the conversation I had: | | ME: What's another saying similar to: "The cat is out of the | bag" or "The genie is out of the bottle"? | | chatgpt: Another similar saying is "the toothpaste is out of | the tube." Like the other two expressions, this phrase | conveys the idea that something has been revealed or | unleashed, and it cannot be undone or put back into its | original state. | | ME: Can you invent a new phrase that has the same meaning, | which is completely new? | | chatgpt: Certainly! Here's a new phrase with a similar | meaning: "The confetti has left the cannon." This phrase | suggests that once something is out in the open or has | happened, it's impossible to reverse or contain it, much like | confetti that has been shot out of a cannon during a | celebration. | | ME: What's the name for this kind of expression? Aphorism? | Fable? | | chatgpt: This kind of expression is known as an idiom. Idioms | are phrases or expressions with a figurative meaning that is | different from their literal meaning. They are commonly used | in everyday language to convey a specific message or | sentiment in a more colorful or memorable way. | Tommstein wrote: | That is impressive, especially given that neither Google | nor DuckDuckGo find any previous trace of that idiom. | johtso wrote: | Not exactly sure why it would be surprising that it can | come up with a convincing idiom when it can produce | remarkably good _poetry_ | bilsbie wrote: | I don't understand why people aren't more impressed with it | clearly understanding and then even inventing idioms. That | shows some real intelligence. | fnordpiglet wrote: | It's because they're confused in thinking human | intelligence isn't learned stochastic expectation. | gary_0 wrote: | That's how the brain evolved, after all: take incoming | input, process according to previous patterns, output a | reaction to peripheral nervous system that will enhance | survival. The evolutionary purpose of a cortex is to | store a statistical model of an animal's environment | based on past stimulus. Human intelligence is just a | complicated permutation of these initial systemic | parameters. | [deleted] | johnchristopher wrote: | OT: I don't know about their scaling strategy for LLM but their | scaling strategy for displaying pictures is disappointing. | | (it's all blurry) | lanshanbob wrote: | I think it's fixed now | ricopags wrote: | Came here to point this out, though not as pithily :D | | Really, really bad mark on whoever is in charge of their web | marketing. Images should never look that bad, not even in | support, but definitely not in marketing. | | edit: so this post is more useful, 4k res using Edge browser | Kelamir wrote: | Last time I viewed it, I believe it wasn't blurry. Perhaps to | scale the traffic the images are now displayed in lower | quality? | | But I'm not sure anymore that it wasn't initially blurry... | Perhaps I'm hallucinating, like large language models. | | Current image displayed is https://www.cerebras.net/wp- | content/uploads/2023/03/Scaling-... , will see if it changes. | Kelamir wrote: | I can confirm, it does change. As of now, it displays one of | higher quality: https://www.cerebras.net/wp- | content/uploads/2023/03/Scaling-... | thewataccount wrote: | They're dynamically scaled and something must be broken. If you | inspect source you can find the raw images, here's a few: | | https://www.cerebras.net/wp-content/uploads/2023/03/Downstre... | | https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-... | | https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-... | | EDIT: Looks like it scores better with less training - up until | it matches GPT-J/Pythia/OPT and doesn't appear to have much | benefit. It maybe scores slightly better then GPT-J which is | pretty "eh", I'm not sure if GPT-J level performance is really | useful for anything? NeoX 20B outperforms it in everything if | you don't care about the amount of training needed. | | Does the better performance for less training matter if that | benefit only applies when it's only performing a lot worse then | GPT-J? It appears to lose it's scaling benefits before the | performance is interesting enough to matter? | [deleted] | johnchristopher wrote: | Disabling ublock on Firefox fixes the issue for me. | | edit: scratch that, it seems the AJAX endpoint returns 504 | more often that not. | thewataccount wrote: | Yeah it seems hit/miss depending on the window size and | stuff. My guess is whatever is supposed to decide what size | to request is messed up. | ftxbro wrote: | > Our paper, which will be available soon, will detail our | training methods and performance results. | | Yay there will be a paper let's gooooooo! | chessgecko wrote: | I wonder what led to such a gap between llama 7b and Cerebras | 13b. I hope they discuss it in the paper. | gpm wrote: | Is there a benchmark comparing the two that I missed? | | Edit: The huggingface page has 0-shot benchmarks which you can | compare against the llama paper | | https://huggingface.co/cerebras/Cerebras-GPT-13B | | https://arxiv.org/pdf/2302.13971.pdf | freeqaz wrote: | I'm on mobile and struggled to compare these two tables | properly. Would you mind posting a summary of your findings? | | Here are some values but I don't know what they mean. LLama | 60B on the left, Cerebras 13B on the right. | | PiQA: 82.8 / 76.6 WinoGrade: 77.0 / 64.6 ARC-e: 78.9 / 71.4 | gpm wrote: | Really short summary: LLaMa is better, even smaller LLaMa | models. | | Table format: Benchmark, Cerebras 13B, LLama 7B, LLama 13B, | LLama 60B | | HellaSwag, 51.3, 76.1, 79.2, 84.2 | | Piqa, 76.6, 79.8, 80.1, 82.8 | | Wino-Grande, 64.6, 70.1, 73.0, 77.0 | | Arc-e, 71.4, 72.8, 74.8, 78.9 | | Arc-c, 36.7, 47.6, 52.7, 56.0 | | OpenBookQA, 28.6, 57.2, 56.4, 60.2 | ftxbro wrote: | This gap makes sense to me. The academic point of the Cerebras | paper is to show their nice empirical scaling law for compute- | optimal training, whereas the academic point of the LLaMA paper | was to show that you can make small models punch above their | weight by training them in a way that is deliberately not | compute-optimal. Of course both of those publications had other | academic and marketing purposes. | | From the Cerebras blog post: "Trained using the Chinchilla | formula, these models provide the highest accuracy for a given | compute budget." | | From the LLaMA paper: "The focus of this work is to train a | series of language models that achieve the best possible | performance at various inference budgets, by training on more | tokens than what is typically used." | visarga wrote: | Of course this is great news, I hope these models can be fine- | tuned to be like lighter versions of chatGPT. But I remember | reading in the LLaMA paper that a small model can still improve | when trained more than the Chinchilla budget. | | > For instance, although Hoffmann et al. (2022) recommends | training a 10B model on 200B tokens, we find that the performance | of a 7B model continues to improve even after 1T tokens. | | Cerebras says: | | > For instance, training a small model with too much data results | in diminishing returns and less accuracy gains per FLOP | | But this is only of concern when you care about the training | cost, such as when you are budget limited researcher or a company | who doesn't deploy models at scale. But when you care about the | total cost of deployment, then making a small model even better | with lots of data is a smart move. In the end it matters more to | have the most efficient model in prediction, not the most | efficient model in training. | skybrian wrote: | What's in the Pile training data they used? How much source code | does it include? | Garcia98 wrote: | I've been following open source LLMs for a while and at first | glance this doesn't seem too powerful compared to other open | models, Flan-Alpaca[0] is licensed under Apache 2.0, and it seems | to perform much better. Although I'm not sure about the | legalities about that licensing, since it's basically Flan-T5 | fine-tuned using the Alpaca dataset (which is under a Non- | Commercial license). | | Nonetheless, it's exciting to see all these open models popping | up, and I hope that a LLM equivalent to Stable Diffusion comes | sooner than later. | | [0]: https://github.com/declare-lab/flan-alpaca | ftxbro wrote: | Their goal isn't to make a powerful model. It's to show how | well compute-optimal models do on test-loss as a function of | increasing model size. This function can be used with some | caveats to forecast the test-loss of larger models for which | compute-optimality becomes more important. | alchemist1e9 wrote: | Sounds like you might be the right person to ask the "big" | question. | | For a small organization or individual who is technically | competent and wants to try and do self-hosted inference. | | What open model is showing the most promise and how does it's | results compare to the various openAI GPTs? | | A simple example problem would be asking for a summary of code. | I've found openAI's GPT 3.5 and 4 to give pretty impressive | english descriptions of code. Running that locally in batch | would retain privacy and even if slow could just be kept | running. | Garcia98 wrote: | Google's Flan-T5, Flan-UL2 and derivatives, are so far the | most promising open (including commercial use) models that I | have tried, however they are very "general purpose" and don't | perform well in specific tasks like code understanding or | generation. You could fine-tune Flan-T5 with a dataset that | suits your specific task and get much better results, as | shown by Flan-Alpaca. | | Sadly, there's no open model yet that acts like a Swiss knife | and gets good-enough results for multiple use cases. | brucethemoose2 wrote: | FYI: Cerebras's nodes are _very_ different than your typical | Nvidia training nodes: | | https://www.anandtech.com/show/16626/cerebras-unveils-wafer-... | | Each individual "chip" has _40GB of SRAM_ vs ~76MB for the Nvidia | H100, and networked pools of external RAM, SSDs and such. Thats | why the training architecture is so different. | arbuge wrote: | https://www.cerebras.net/product-chip/ | | There's a comparison picture there of one of their chips | alongside a regular GPU chip. Effectively they use up the | entire wafer. | brucethemoose2 wrote: | Yeah, and that doesn't even do the nutty IO on these things | justice. | | A 16x CS2 cluster like they describe is like a huge Nvidia | cluster in terms of throughput, but more like a single Nvidia | node structurally. | ipsum2 wrote: | Cerebras makes impressive hardware, but Nvidia still | performs better in every regard. The most telling factor is | that Cerebras claims they're too busy to run common | benchmarking (e.g. MLPerf) to compare against Nvidia. | alchemist1e9 wrote: | It's unbelievable stuff. Does anyone know how much a single | box costs? They are selling them it looks like. | freeqaz wrote: | If you have to ask, you can't afford it! | | Mostly teasing but my guess would be $500k+ since they'll | likely price it so that it is the same $ as the | equivalent NVIDIA cluster (or very close to it). | alchemist1e9 wrote: | Actually if they are around $2M looks like my company can | afford one. Given this is just getting started it looks | promising as I'm sure future generations will be more | affordable. | ericd wrote: | On the order of several million USD for the second gen | system. Last I heard, they're still at lowish volumes, | selling some to national labs and the like. | sbierwagen wrote: | CS-1 costs "$2-3 million", CS-2 costs "several" million. | | A single Nvidia H100 costs somewhere around $30,000 each, | so a GPU server with every slot populated costs about | $300,000. | brucethemoose2 wrote: | ServeTheHome claims "HGX A100 platforms, when they are | sold as single servers are generally in the $130K-$180K | even leaving a very healthy margin for OEMs/ resellers" | | https://www.servethehome.com/graphcore-celebrates-a- | stunning... | | Not sure about the H100, but it seems to be more supply | constrained (hence pricier) atm. | | Now, the real question is how many HGX nodes "equals" a | single CS2 node. The math here is extremely fuzzy, as the | benefit to such extreme node consolidation depends on the | workload, and the CS-2 takes up less space, but the HGX | cluster will have more directly accessible RAM and better | turnkey support for stuff since its Nvidia. | bubblethink wrote: | There is cloud pricing on the website. | https://www.cerebras.net/product-cloud/ | alchemist1e9 wrote: | This is actually really important from my perspective. It | looks like an end user can work backwards from available | inference hardware, or interference budget, required | speed, then figure out a viable model size. Bring their | own data and then fine tune or train from scratch. | | This is getting so real so fast. | tombert wrote: | Has anyone tried this? I have 96GB of GPU memory; will that be | enough to run the biggest model? | cuuupid wrote: | 13B fits nicely even in a 3090 (24gb vram)! | Filligree wrote: | Only if you can use int8, which may or may not be convenient. | lxe wrote: | with load_in_8bits ? | spi wrote: | I have not tried, but 96GB of GPU memory is plenty, for | inference there should certainly be no issue. Their biggest | model has 13B parameters, you should be able to run inference | (float16) already with 32GB of memory. | | With 96GB of memory you should also be able to fine-tune it | (possibly some tricks like gradient accumulation and/or | checkpointing might be needed), but you have to be ready for | many days of computation... | alchemist1e9 wrote: | > but you have to be ready for many days of computation... | | I was thinking since we have API prices in tokens and now it | looks like self hosted inference on high end GPUs for similar | models. Then based on electricity prices there will be a | self-hosted prices in tokens. Then how close are these | already? What is the markup today from roughly the raw | electricity cost that OpenAI has. | wg0 wrote: | Noob to ML in practice. These models containing weights, all of | them, do they have a standard file/binary format? | examplary_cable wrote: | [I'm not an expert] but I believe .ckpt and .safetensors. | | The problem with .ckpt is that it executes arbitrary code in | your machine(very unsafe). While .safetensors was made by | huggingface in order to have a safe format to store the | weights. I've also seen people load up the llama 7B via a .bin | file. | eldenring wrote: | > Trained using the Chinchilla formula, these models provide the | highest accuracy for a given compute budget. | | I'm confused as to why 111 million parameter models are trained | with the Chinchilla formula. Why not scale up the training data? | If you're training smaller models, surely optimizing performance | is better than optimizing total compute. | | Seems like a silly misunderstanding of the Chinchilla paper, but | I'm sure I'm missing something | gamegoblin wrote: | True. There was a good blog post published about this a few | weeks ago: https://finbarr.ca/llms-not-trained-enough/ | | Money quote for those who don't want to read the whole thing: | | ''' | | When people talk about training a Chinchilla-optimal model, | this is what they mean: training a model that matches their | estimates for optimality. They estimated the optimal model size | for a given compute budget, and the optimal number of training | tokens for a given compute budget. | | However, when we talk about "optimal" here, what is meant is | "what is the cheapest way to obtain a given loss level, in | FLOPS." In practice though, we don't care about the answer! | This is exactly the answer you care about if you're a | researcher at DeepMind/FAIR/AWS who is training a model with | the goal of reaching the new SOTA so you can publish a paper | and get promoted. If you're training a model with the goal of | actually deploying it, the training cost is going to be | dominated by the inference cost. This has two implications: | | 1) there is a strong incentive to train smaller models which | fit on single GPUs | | 2) we're fine trading off training time efficiency for | inference time efficiency (probably to a ridiculous extent). | | Chinchilla implicitly assumes that the majority of the total | cost of ownership (TCO) for a LLM is the training cost. In | practice, this is only the case if you're a researcher at a | research lab who doesn't support products (e.g. FAIR/Google | Brain/DeepMind/MSR). For almost everyone else, the amount of | resources spent on inference will dwarf the amount of resources | spent during training. | | ''' | haldujai wrote: | While true I think this also misses that "for almost everyone | else" you're probably not (or at least should not) be trying | to optimize zero-shot performance if you have an intended | high inference use case so I don't think Chinchilla would be | all that relevant. | vintermann wrote: | I have a suspicion that good zero-shot performance is a | good starting point for fine-tuning. If you have more than | one intended high inference use case, or can imagine a | couple of new ones on the horizon, it might still be best | to not target the first use case directly. | haldujai wrote: | You're not wrong, the Chinchilla rationale is that it may be | more compute efficient to obtain a given loss using larger | model sizes if the budget allows. As another commenter states | this ignore the inference part of the equation. | | As an example the BERT/RoBERTa family were trained for much | longer than Chinchilla, you do get diminishing returns though. | | There is a point of overtraining where downstream performance | is impacted but that's pretty high. | | I think part of the answer to this is also that xxx million | parameter decoder-only models don't seem to be that useful so | it may not be worthwhile to optimize them for performance? | ftxbro wrote: | The point of those smaller models is for the "Cerebras Scaling | Law for Compute-Optimal Training" which is the straight line | plot in the image at the top of their webpage when you click | the link. | | They want you to think it's reasonable that because the line is | so straight (on a flops log scale) for so long, it could be | tempting to extrapolate the pile-loss consequences of | continuing compute-optimal training for larger models beyond | their largest 13B one, with the obvious caveat that the | extrapolation can't continue linearly much further if for no | other reason than the test loss isn't going to go below zero | (it will flatten out sooner than that). | | If you trained beyond compute-optimality on smaller models, it | would mess up their straight line and make it look like we are | sooner hitting diminishing returns on test loss. | simonw wrote: | "Cerebras open sources seven GPT-3 models from 111 million to 13 | billion parameters." | | I don't understand why they describe them as GPT-3 models here as | opposed to calling them GPT models. Or even LLMs - but I guess | that acronym isn't as widely recognized. | wsgeorge wrote: | I think GPT-3 is used as a benchmark for performance, so saying | a model is on par with GPT-3 should give you an idea of what | you can get out of it. | | IIRC most open source models to date - including the semi-open | LLaMAs - have GPT-3-like performance. Nothing gets close to | GPT-3.5 and beyond. | patientplatypus wrote: | [dead] ___________________________________________________________________ (page generated 2023-03-28 23:00 UTC)