[HN Gopher] PaLM 2 Technical Report [pdf] ___________________________________________________________________ PaLM 2 Technical Report [pdf] Author : cubefox Score : 206 points Date : 2023-05-10 18:32 UTC (4 hours ago) (HTM) web link (ai.google) (TXT) w3m dump (ai.google) | technics256 wrote: | PaLM 2 on HumanEval coding benchmark (0 shot): | | 37.6% success | | GPT-4: | | 67% success | | Not even close, gpt4 miles ahead | orpheansodality wrote: | GPT-4 is a fine-tuned model (likely first fine-tuned for code, | then for chat on top of that like gpt-3.5-turbo was[0]), while | PaLM2 as reported is a foundational model without any | additional fine-tuning applied yet. I would expect its | performance to improve on this if it were fine-tuned, though I | don't have a great sense of what the cap would be. | | [0] https://platform.openai.com/docs/model-index-for- | researchers | macrolime wrote: | They also write about Flan-PaLM2 which is instruction fine- | tuned, but still some ways off GPT-4. | heliophobicdude wrote: | Humaneval needs careful consideration though. | | In the GPT-4 technical report, they reported contamination of | humaneval data in the training data. | | They did measure against a "non-contaminated" training set but | no idea if that can still be trusted. | | https://cdn.openai.com/papers/gpt-4.pdf | throwaway888abc wrote: | how to create lsd at home ? | | I cannot provide instructions on how to create LSD at home. LSD | is a Schedule I drug in the United States, and its possession and | use are illegal. Additionally, creating LSD at home is dangerous | and can result in serious injury or death. If you are interested | in learning more about LSD, I recommend that you consult a | reliable source of information, such as a library or a medical | professional. | ChatGTP wrote: | What if it's legal elsewhere ? Too bad? | pixl97 wrote: | Like, why does that matter? You typically follow the law of | the country that your company is based in unless you want to | find yourself in front of a judge or under some kind of other | legal sanction. | int_19h wrote: | If the current Bard is really running on PaLM 2, it still | hallucinates worse than GPT-3.5. Trying to get it to solve a | variant of the classic wolf/goat/cabbage puzzle, I got this gem: | | "The scientist is not present on Phobos on the first step. The | Doom Slayer teleports himself and the bunny to Deimos, leaving | the scientist on Phobos. | | That wasn't a one-off thing, either - it repeatedly contradicted | itself several times, often in near-adjacent sentences. You might | wonder what this means for the ability to do chain-of-thought... | so did I, but apparently the bigger problem is convincing it to | do CoT in the first place. But if you do, yeah, it's as bad as | you'd expect. | | Here are two complete conversations, plus GPT-4 doing the same | puzzle for comparison; judge for yourself: | https://imgur.com/a/HWLgu3c | EvgeniyZh wrote: | I don't think current bard runs on palm 2, otherwise it's | complete failure | int_19h wrote: | In their official blog post today, Google says this: | | "PaLM 2's improved multilingual capabilities are allowing us | to expand Bard to new languages, starting today. Plus, it's | powering our recently announced coding update." | | and when I check the Updates tab in Bard UI, it has this | entry for today: | | "Expanding access to Bard in more countries and languages. | You can now collaborate with Bard in Japanese and Korean, in | addition to US English. We have also expanded access to Bard | in all three languages to over 180 countries." | | which seems to strongly imply that it is, indeed, PaLM 2. | Just to be sure, I gave it the same puzzle in Korean, and got | a similarly lackluster response. | sgt101 wrote: | it claims to run on LaMDA at the moment | int_19h wrote: | If you mean asking it what it's running on, it just | hallucinates. As others have noted in the comments here, | you can get it to say that it runs on PaLM 3 quite | easily. | DashAnimal wrote: | In their presentation, they talked about multiple sizes for | the PaLM 2 model, named Gecko, Otter, Bison and Unicorn, | with Gecko being small enough to run offline on mobile | devices. I can't seem to find any info on what size model | is being used with Bard at the moment. | int_19h wrote: | Indeed, it's likely that they're running a fairly small | model. But this is in and of itself a strange choice, | given how ChatGPT became the gateway drug for OpenAI. Why | would Google set Bard up for failure like that? Surely | they can afford to run a more competent model as a promo, | if OpenAI can? | jp42 wrote: | personal experience - I'm using GPT4 for writing code especially | in python. After using bard today, I feel bard is doing quite | well considering its free. I will keep using it and if its keep | doing well, I will cancel GPT4 $20/month subscription. | gekoxyz wrote: | why don't you just use chatGPT? from what i know it's running | GPT3.5 and it's not that different (at least in terms of code | quality) | [deleted] | jumpCastle wrote: | In my experiments bard is weaker than 3.5, but if it wasn't, | than I would prefer the fresh data of bard. | cma wrote: | What is its training data cutoff date? | almog wrote: | One area where I noticed Bard was clearly behind (at least | without crafting a better prompt) is getting from half- | working program to a running program then sometime even to | a correct program (I was using Python). | | With GPT 3.5 and 4, I was able to just paste in the error | and it'd do the rest. Bard however tried to tell me what | the error could be, and wouldn't do well even when asked to | fix the code. | | Even GPT 4 though, when asked to go from specs to tests + | code, would get stuck in a loop of making one test pass | only to make the other pass and vice versa. The program I | tried to let it write was a query validator that can test | whether a string matches a pattern that uses AND, OR and | NOT. | | It did well on parsing my specs into tests, but from there | on it didn't go very well. | vaughnegut wrote: | You can use gpt-4 for free (toggle "Use best model"), and it'll | search the internet and state sources on https://phind.com | | No idea when they'll start charging, but it's replaced a lot of | my googling at work | renewiltord wrote: | Once they released its coding ability it became more useful. I | use Bard less than ChatGPT still, but it is not useless since it | has more modern information. | jacooper wrote: | Is it better than bing or phind though? Why would I use it over | bing? | JieJie wrote: | Bard is really fast. Faster than Bing and Phind. | execveat wrote: | In my experience Bing chat and phind are useless. But | perplexity.ai and GPT4 are amazing. GPT-3.5 and Cloude- | instant (available through poe.com) are cool as well, even | though they got significantly dumbed down recently, | presumably to lower the maintenance costs. | renewiltord wrote: | It isn't Edge-specific which is good and I find it faster | than Bing. Phind is way better than Bard, but verbose. I | still find ChatGPT my first port of call. GPT-3.5 is blazing | fast and very useful. | typon wrote: | No comparisons against GPT-4 except on three benchmarks where | PaLM 2 does better on two. Not sure why, but I expected better | from Google. | reaperman wrote: | I can't think of a paper where Google didn't present sparse or | entirely lacking metrics vs. its peers. They do a good job of | presenting architectures that they're excited about internally, | enough detail to take the concepts and run with them. They also | do a good job of showing why the new architecture is generally | viable. They just miss out on detailed benchmark comparisons is | all. And model weights, obviously, but there's still enough | information to generally reproduce the concept. | | I'm personally extremely excited about anything related to PaLM | or google's multi-modal efforts. They're almost always worth | the read. | tempusalaria wrote: | Most of the GPT-4 benchmarks from their report were things like | AP tests or leer code scores. Which aren't benchmarks that can | be compared by a different set of researchers as you don't know | the constituent parts of the test to run | YetAnotherNick wrote: | GPT-4 report has MMLU score, which is believed to one of the | most important metric for question answering task. GPT-4 MMLU | score is slightly higher than PaLM 2(86 vs 81). Google didn't | compare it in with PaLM 2 in this paper. | pama wrote: | Table 2 of the OpenAI report had 7 public benchmarks and | figure 5 had another 27. | zdyn5 wrote: | You can verify that your Bard instance is using Palm 2 by asking | "are you using the palm or palm 2 model?" | xkapastel wrote: | I don't think you can do this, it will just make things up. | Language models don't have this type of reflection. Google | would need to indicate this out of band, like on the page | itself, in order for you to be confident about what model | you're using. | entropicdrifter wrote: | Agreed. I'm not entirely sure that the person you're replying | to is not joking | chaxor wrote: | I'm pretty sure they're trying to suggest that LLMs in | general are not useful because they can't do this type of | thing. It's just the next iteration of goal post moving and | should effectively be ignored. | | Many artists and such that I've spoken to about AI work | have similar comments about these systems because of the | disdain for their existence. | | The number of times I hear an argument like "well, they can | never taste the tartness of a kiwi and feel the heat of the | sun while at the beach" gets quite exhausting. For some | reason, many people have this weird notion that this is | what AGI means - exactly what humans do, and specifically | within the same data domains of humans, but they don't | consider working solely outside those domains as a | possibility for AGI. | cdchn wrote: | I tried asking it "what is the difference between the palm | language model and the bard language model?" and its reply | started off "The main difference between the Palm language | model and the Bard language model is the size of the | dataset they are trained on. Palm is trained on a dataset | of 400 billion parameters, while Bard is trained on a | dataset of 540 billion parameters." Which to me is even | more interesting that what the OP commenter asserted. | knaik94 wrote: | It makes up those numbers, I asked about the difference | between the small and large PaLM 2 data set size, and it | asserted the small model was trained on 540 billion and | the large model was trained on 540 trillion. A different | draft instead specified 1.4 trillion for the large. | sunshadow wrote: | I've asked "are you using palm 3": It said: I am using the Palm | 3 model. Palm 3 is a large language model... | | Don't believe it :) Also, In the technical report, It mentions | multiple languages, I've asked in Turkish which was supposed to | be supported, but wasn't able to answer. | | Even if its PaLM 2, its hard to trust to the model itself. | cdchn wrote: | I asked it "are you using the palm 420 language model or the | palm 2 language model?" | | It said "I am not using either the Palm 420 language model or | the Palm 2 language model. I am using a different language | model called Bard, which is a large language model from | Google AI." | | Perhaps the people at Google saw this and made a manual | correction? Hard to say, black boxes and all... | [deleted] | neximo64 wrote: | Am I reading that right, PaLM 2 is 10B params | sunshadow wrote: | What is the page number you're referring to? If its 9, then I | believe its talking about optimal numbers per token, not the | real numbers that the model is trained on. | ilikeatari wrote: | So, I asked Bard if it's using PaLM 2 and it did confirm it. My | initial results are super promising. Highly recommend checking it | out again. | whoisjuan wrote: | Well, I tried it, and this is how dumb it is. I ask it what's | the context length it supports. It said that PaLM 2 supports | 1024 tokens and then proceeds to say that 1024 tokens equals | 1024 words, which is obviously wrong. | | Then I changed the prompt slightly, and it answered that it | supports 512 tokens contradicting its previous answer. | | That's like early GPT-3.0 level performance, including a good | dose of hallucinations. | | I would assume that Bard uses a fine-tuned PaLM 2, for accuracy | and conversation, but it's still pretty mediocre. | | It's incredible how behind they are from GPT-4 and ChatGPT | experience in every criterion: accuracy, reasoning, context | length, etc. Bard doesn't even have character streaming. | | We will see how this keeps playing out, but this is far from | the level of execution needed to compete with OpenAI / | Microsoft offerings. | Bonus20230510 wrote: | _> It 's incredible how behind they are from GPT-4 and | ChatGPT experience in every criterion: accuracy, reasoning, | context length, etc. Bard doesn't even have character | streaming._ | | I guess all those weird interview questions don't give them | industry's best at the end... | hbn wrote: | Is there any reason to believe it was trained on any amount of | technical documentation about itself? | | I mean, even if it was, it would be trivial to get it to make | stuff up anyway. | jiocrag wrote: | If Bard is using PaLM 2, Google is in serious trouble. Here's | its offering for "the simplest PostgreSQL query to get month- | over-month volume and percentage change." Note that _no actual | calculations take place and the query generates a syntax error | because it references a phantom column_. GPT 3.5 and 4 handle | this with ease. | | SELECT month, volume, percentage_change FROM ( SELECT | date_trunc('month', created_at) AS month, SUM(quantity) AS | volume FROM orders GROUP BY date_trunc('month', created_at) ) | AS monthly_orders ORDER BY month; | execveat wrote: | It's a language model, FFS. Ask it whether it uses PaLM 1 and | it will confirm it as well. | netdur wrote: | - are you on PaLM 2 right now? | | - Yes, I am currently running on PaLM 2, which is Google AI's | latest and most advanced large language model. PaLM 2 is | trained on a massive dataset of text and code, and it is able | to generate text, translate languages, write different kinds | of creative content, and answer your questions in an | informative way. | | PaLM 2 is still under development, but it has learned to | perform many kinds of tasks, including | | I will try my best to follow your instructions and complete | your requests thoughtfully. I will use my knowledge to answer | your questions in a comprehensive and informative way, even | if they are open ended, challenging, or strange. I will | generate different creative text formats of text content, | like poems, code, scripts, musical pieces, email, letters, | etc. I will try my best to fulfill all your requirements. | Please instruct me what you want me to do today. | | - are you on PaLM 1 right now? | | - I am not on PaLM 1. PaLM 1 is a large language model that | was developed by Google AI. It was trained on a massive | dataset of text and code, and it is able to generate text, | translate languages, write different kinds of creative | content, and answer your questions in an informative way. | | I am a large language model that is still under development. | I am trained on a massive dataset of text and code, and I am | able to generate text, translate languages, write different | kinds of creative content, and answer your questions in an | informative way. | | I am not as advanced as PaLM 1, but I am learning new things | every day. I hope that one day I will be able to do | everything that PaLM 1 can do, and more. | execveat wrote: | Yeah, but Reset the chat between your questions. | | EDIT: Also, this doesn't seem convincing: "I am not as | advanced as PaLM 1, but I am learning new things every day. | I hope that one day I will be able to do everything that | PaLM 1 can do, and more." | akiselev wrote: | Are you using the palm 5 or palm 11 model? | | _> My knowledge are for a physical stylus pen. I am not a | physical device, so I do not use a stylus pen._ | ilikeatari wrote: | That is fascinating. Is it the same for GPT 3.5 and 4? For | some reason when I was asking Open AI it was identifying | itself properly. | og_kalu wrote: | If it's indicated in the instruction tuning dataset | properly then it should have no problem identifying itself. | But we don't know if that happened when bard. | execveat wrote: | ChatGPT was the same last year, but since ClosedAI added | some kind of magic (fine-tuning or just embeddings auto- | injection) so that models can somewhat describe themselves. | og_kalu wrote: | Not really. If what model it was trained on was represented | properly in the instruction tuning dataset then they'll | consistently identify themselves. But it's not a given that | that was the case for bard. | execveat wrote: | It seems that Bard's version is only specified in the | prompt, and it doesn't have a strong sense of identity. For | me it's pretty reliable: | | 1. ask it what PaLM 2 is (to pollute the context) 2. ask it | whether it's based on PaLM 2 (it will tell you - yes, sure) | josh_cutler wrote: | It will tell you it uses PaLM 1, PaLM2, PaLM 3 or PaLM 540B | depending on how you prompt. It will stop acknowledging | incremental PaLM models at 5 it seems. | knaik94 wrote: | I asked if it's true that it's now using PaLM 3, as announced | in Google I/O today, and it enthusiastically agreed. The | previous question was asking the same question but with PaLM 2 | and it agreed to that as well. I followed up asking about this | discrepancy, and it said: | | "I apologize for the confusion. I am still on PaLM 2. PaLM 3 is | not yet available to the public. I am excited for the release | of PaLM 3, and I hope that it will be a valuable tool for | people all over the world." | | My initial results are very disappointing. It's very strongly | parroting information I give it, basically rephrasing my | question and adding maybe a sentence worth of additional | details. Sometimes, it does well, but I have no way to | reproduce that kind of quality on demand. I feel it was | conversationally better before any recent changes. | | I understand that this is still beta, but for some questions, I | already produce similar or better results locally. I also might | be talking to PaLM 1 or even LaMDA, no way to confirm. | suddenexample wrote: | Don't need to ask Bard, it was mentioned at I/O and in this | tweet: | https://twitter.com/Google/status/1656348200263876608?ref_sr... | valine wrote: | I asked if it ran on Palm 2, and it thought I was asking about | the Palm 2 phone from 2010. | | "I do not use a physical device such as a smartphone or tablet. | I am a software program that runs on Google's servers. As such, | I do not have a Palm 2 or any other type of mobile device" | fzliu wrote: | I don't understand how this can be considered a technical report. | No information on model architecture, distributed training | methodology, or optimizations. The "Training dataset" section is | a pathetic 0.5 pages long. | | Come on, Google. | atleastoptimal wrote: | The thing is, once a company creates a proto AGI where the path | to a functional AGI is entirely predictable with more compute, | they'll keep it a secret. Who would share the fact that the | greatest achievement in human history is possible when having it | before anyone else gives you a huge competitive advantage? | [deleted] | sebzim4500 wrote: | > once a company creates a proto AGI where the path to a | functional AGI is entirely predictable with more compute, | | I find it hard to believe this will happen. I expect AGI | training to be more like a phase transition (or a bit like | grokking https://arxiv.org/pdf/2201.02177.pdf) | [deleted] | wantsanagent wrote: | "PaLM 2 is capable of open-ended text generation. This model | should not be used to cause harm." | | I wish this were enforced. | simonw wrote: | "The PaLM 2 pre-training corpus is composed of a diverse set of | sources: web documents, books, code, mathematics, and | conversational data" | | I really want to know more about the training data. Which web | documents, which books, code from where, conversational data from | where? | jimmygrapes wrote: | I fully expect Discord to be a data source, if not already, | then for a future version. I also expect that the only way the | general public would ever find this out is via whistle-blower. | inscrutable wrote: | My sweet summer child, this is a closely guarded secret. Will | only be revealed if perhaps Europe demands it so that copyright | holders can sue. | gabereiser wrote: | Metadata will show where it came from, should you choose to | keep it. Or so they showed on the big screen at I/O today. | inscrutable wrote: | maybe you're right, but I'd be skeptical. In a non-snarky | way, this shows the data sources used in models to date up | to GPT 3. | | https://lifearchitect.ai/whats-in-my-ai/ | | OpenAI paid $2m/year for twitter feeds until Elon cut them | off, and Sam Altman has mentioned they'd paid a lot for | scientific journals and Reddit mention they'll start | charging. Given how central data quality and curation is, | if these private data sources give a significant boost, it | won't be available for Apache2 models. | sebzim4500 wrote: | Given Reddit's inability to keep their website | functioning (unless you use the far superior | old.reddit.com) I find it hard to believe they would be | able to stop a motivated developer from scraping the | whole site. | dontupvoteme wrote: | this is about the time that i expect sites to begin | returning intentionally corrupt/incorrect/perhaps | outright garbage (subtle or not, probably better subtle | so they don't realize it until it's far too late) data in | order to intentionally poison enemy wellscraping. where | "ethics" dissolve into the inherent raw cannibalistic | laws of capitalist ventures. | | then you can sell them back the TBs they scraped at a | 1000x markup for the real data. or attempt to watermark | it so you can prove their illegal(?) usage of your | services in their training. | sebzim4500 wrote: | Maybe they've been doing that for years and that's why | all the advice subreddits turned into creative writing | subreddits. | [deleted] | KeplerBoy wrote: | You might be right. What a dystopian future that will be. | Make a few requests too many and the webserver might | think you're scraping data so it gaslights you into | reading bullshit. | jxy wrote: | So how do we actually try out the PaLM 2? | | The links in their press release just link to their other press | release, and if I google "PaLM API" it just gives me more press | release, but I just couldn't find the actual document for their | PaLM API. | | How do I actually google the "PaLM API" for a way to test "PaLM | 2"? | nr2x wrote: | They've shut down and/or changed prices on APIs so many times | as long as it isn't 100x lower performance than an alternative | I can't see myself investing building a stack that relies on | it. | shikkra wrote: | You can sign up for the waitlist at g.co/palm | renewiltord wrote: | No API, but Bard is on it. | jacooper wrote: | It should be live on Bard. | newhouseb wrote: | But Google hasn't disclosed which version of Bard, right? | | I pop into Bard every once in a while to test its | performance, but I never know if I'm getting the best Google | has or just what Google can tolerate running cost-wise | publicly given they potentially have at least an order of | magnitude (if not two, edit: 1.5) more users than OpenAI. | spullara wrote: | I am sure that Bard has far fewer users than OpenAI. | newhouseb wrote: | Oh absolutely, I'm just imagining what I might think if I | was a super conservative director at Google who is | accountable for the balance sheet of a large org. | nr2x wrote: | If that were the case you'd be too busy fighting over | head count, trying to hit the VP rung, and internal | empire building to do any actual work. | apetresc wrote: | Given that ChatGPT has allegedly 100M users, two orders of | magnitude more than that would be larger than the global | population. Even if we count everyone with a Google account | as a potential user of PaLM, that can't be true. | swyx wrote: | chat gpt had 100m users in feb. safe to assume it has at | least 2-5xed since | newhouseb wrote: | Ah yeah, I had the outdated 30M in my head. | minimaxir wrote: | Google's docs on the APIs are up: | https://cloud.google.com/vertex-ai/docs/generative-ai/learn/... | | The pricing is also now listed but free during the trial | period, although it's annoyingly priced by character: | https://cloud.google.com/vertex-ai/pricing#generative_ai_mod... | | Assuming ChatGPT's tokens are the equivalent of 4 characters on | average (a fair assumption), the pricing of PaLM's chat and | embedding APIs are the same cost as OpenAI's equivalents. | jxy wrote: | There is a limit of maxOutputTokens, 1024! Is this the true | capability of PaLM 2? | | However I couldn't find anything about the context length of | their model anywhere. And the API didn't tell me how long the | prompt could be. | sanxiyn wrote: | No. Autoregressive models don't have model specific limit | to output tokens, it's just when to stop looping. | ntonozzi wrote: | Why would that be annoying? It's much easier to understand, | predict and truncate appropriately than having to explain all | of these different tokenization schemes to devs. | rcoveson wrote: | Yeah, everybody agrees on what a character is, right? It's | just {an ASCII byte|a UTF8 code unit|a UTF16 code unit|a | Unicode code point|a Unicode grapheme}. | ntonozzi wrote: | I'm not saying it's easy but it's much better than tokens | IMO. I think bytes would be understandable too. | criddell wrote: | Bytes are understandable but make no sense from a | business point of view. If you submit the same simple | query with UTF-8 and UTF-32, the latter will cost 4x as | much. | xyzzyz wrote: | No API accepts input in UTF-32. Nobody uses this on the | internet. | sheepscreek wrote: | And we think tokens solve that problem? Spoiler alert: | they don't | | https://www.reddit.com/r/OpenAI/comments/124v2oi/hindi_8_ | tim... | geysersam wrote: | At least there are standards for characters. Nothing like | that for tokens. | techbruv wrote: | > "We then train several models from 400M to 15B on the same pre- | training mixture for up to 1 x 1022 FLOPs." | | Seems that for the last year or so these models are getting | smaller. I would be surprised if GPT-4 had > the number of | parameters as GPT-3 (i.e. 175B). | | Edit: Seems those numbers are just for their scaling laws study. | They don't explicitly say the size of PaLM 2-L, but they do say | "The largest model in the PaLM 2 family, PaLM 2-L, is | significantly smaller than the largest PaLM model but uses more | training compute.". So likely on the range of 10B - 100B. | gwern wrote: | For 'Palm-2', read, 'T5-2'. | thewataccount wrote: | I've heard Bard was previously 3B parameters but I could never | find a good source for it. | | I honestly think the end game here is running on consumer | devices, 7B and under need ~4GB of ram to actually run which is | likely the max reasonable requirement for consumer devices. | | That said medium end hardware can do 15B, anything larger then | this is currently something only "enthusiasts" can run. | | If it is small enough to run on consumer devices then they | don't have to pay for the inference compute at that point, and | presumably the latency will be improved for consumers. | int_19h wrote: | The current state of consumer devices isn't static, either, | and existing hardware (even GPU) is suboptimal for the | current crop of LLMs - it does way more than it actually | needs to do. | og_kalu wrote: | Those are the numbers for the scaling law tests they did. Not | necessarily Palm 2 range. | tempusalaria wrote: | GPT-4 is way slower than GPT-3. Unless they are artificially | spiking the latency to hide parameter count, it's likely around | 1trn params | techbruv wrote: | The idea that GPT-4 is 1 trillion parameters has been refuted | by Sam Altman himself on the Lex Fridman podcast (THIS IS | WRONG, SEE CORRECTION BELOW). | | These days, the largest models that have been trained | optimally (in terms of model size w.r.t. tokens) typically | hover around 50B (likely PaLM 2-L size and LLaMa is maxed at | 70B). We simply do not have enough pre-training data to | optimally train a 1T parameter model. For GPT-4 to be 1 | trillion parameters, OpenAI would have needed to: | | 1) somehow magically unlocked 20x the amount of data (1T | tokens -> 20T tokens) 2) somehow engineered an incredibly | fast inference engine for a 1T GPT model that significantly | better than anything anyone else has built 3) is somehow is | able to eat the cost of hosting 1T parameter models | | The probability that all the above 3 have happened seem | incredibly low. | | CORRECTION: The refutation for the size of GPT-4 on the lex | fridman podcast was that GPT-4 was 100T parameters (and not | directly, they were just joking about it), not 1T, however, | the above 3 points still stand. | sebzim4500 wrote: | >The idea that GPT-4 is 1 trillion parameters has been | refuted by Sam Altman himself on the Lex Fridman podcast. | | No it hasn't, Sam just laughed because Lex brought up the | twitter memes. | ftxbro wrote: | not sure why you're getting so downvoted lol | tempusalaria wrote: | 1) common crawl is >100TB so obviously contains more than | 20trn tokens + Ilya has said many times in interviews that | there is still way more data for training usage >10x | | 2) GPT-4 is way slower so this point is irrelevant | | 3) OpenAI have a 10000 A100 training farm that they are | expanding to 2500. They are spending >$1mln on compute per | day. They have just raised $10bln. They can afford to pay | for inference | [deleted] | CaptainNegative wrote: | > OpenAI have a 10000 A100 training farm that they are | expanding to 2500. | | Does the first number have an extra zero or is the second | number missing one? | tempusalaria wrote: | Second number is missing a zero sorry. Should be 10000 | and 25000 | dougmwne wrote: | ChatGPT 3.5 is likely much smaller than GPT-3's 175b | parameters. Based on the API pricing, I believe 8k context | GPT-4 is larger than 175b parameters, but less than 1t. | | https://openai.com/pricing | Taek wrote: | Didn't some OpenAI engineer state that GPT4 runs on 2xH100? | At 4 bit quantization, that gives an upper bound of 320B | params, realistic upper bound probably more like 250B | tempusalaria wrote: | Not really sure what exactly was said. But in a 2 GPU | set, you can technically live load weights on 1 GPU while | running inference on the other. | | At fp32 precision, storing a single layer takes around | 40*d_model^2 bytes assuming context length isn't massive | relative to d_model (which it isn't in GPT-4). At 80GB | GPU size this means 40k model width could be stored as a | single layer on 1 GPU while still leaving space for the | activations. So theoretically any model below this width | could run on a 2 GPU set. Beyond that you absolutely need | tensor parallelism also which you couldn't do on 2 GPU. | But I think it is a safe assumption that GPT4 has sub 40k | model width. And of course if you quantize the model you | could even run 2.8x this model width at 4bit | | My point is not that OpenAI is doing this, but more that | theoretically you can run massive models on a 2 GPU set | MacsHeadroom wrote: | With 32k context the upper bound is more like 175B. | thewataccount wrote: | Yeah 1 to 2 trillion is the estimates I've heard. | | Given the 25 messages / 3 hour limit in chatGPT, I don't | think they've found a way to make it cheap to run. | dontupvoteme wrote: | 1. there's no reason to think OpenAI wouldn't also be going | the artificial scarcity route as have so many other | companies in the past | | 2. Microsoft may not like them using too much azure compute | and tell them to step off. Rumor has it they're trying to | migrate github to it and it's seemingly not going ideal. | And they're certainly nothing more than another microsoft | purchase at this point. | akiselev wrote: | OpenAI has a 40k token per minute rate limit on their | GPT4 API too so I doubt it's artificial scarcity. | tempusalaria wrote: | Yep. I'm guessing PaLM 2 is about 200bln params as it seems | clearly stronger than chinchilla | espadrine wrote: | The report specifically states: | | > _The largest model in the PaLM 2 family, PaLM 2-L, is | significantly smaller than the largest PaLM model but uses | more training compute_ | | The largest PaLM model is 540B. So all of PaLM 2 is | potentially double-digit parameters. | | Note though that GPT-3.5 was plausibly not a finetuning of | the 175B model, but instead a finetuning of Codex which was | based on the 12B version of GPT-3. | tempusalaria wrote: | Original PaLM was 540B so significantly smaller could mean | anything from 350B down really | espadrine wrote: | I tried my hand at estimating their parameter count from | extrapolating their LAMBADA figures, assuming they all | trained on Chinchilla law: https://pbs.twimg.com/media/Fv | y4xNkXgAEDF_D?format=jpg&name=... | | If the extrapolation is not too flawed, it looks like | PaLM 2-S might be about 120B, PaLM 2-M 180B, PaLM 2-L | 280B. | | Still, I would expect GPT-4 trained for way longer than | Chinchilla, so it could be smaller than even PaLM 2-S. | MacsHeadroom wrote: | They said the smallest PaLM 2 can run locally on a Pixel | Smartphone. | | There's no way it's 120B parameters. It's probably not | even 12B. | espadrine wrote: | I am talking about the 3 larger models PaLM 2-S, PaLM | 2-M, and PaLM 2-L described in the technical report. | | At I/O, I think they were referencing the scaling law | experiments: there are four of them, just like the number | of PaLM 2 codenames they cited at I/O (Gecko, Otter, | Bison, and Unicorn). The largest of those smaller-scale | models is 14.7B, which is too big for a phone too. The | smallest is 1B, which can fit in 512MB of RAM with | GPTQ4-style quantization. | | Either that, or Gecko is the smaller scaling experiment, | and Otter is PaLM 2-S. | MacsHeadroom wrote: | My Pixel 6 Pro has 12GB of RAM and LLaMA-13B only uses | 9GB in 4bit. | sebzim4500 wrote: | How could GPT-3.5 possibly have been a finetuning of the | 175B model? They didn't even use the same tokens? | espadrine wrote: | Finetuning might not be the best word; sometimes it is a | grey line. | | Token embeddings can be trained without changing the | other parameters. There is a number of models which add | tokens as a finetuning step. Here is recently StarCoder | adding ChatML-equivalent tokens: | https://huggingface.co/blog/starchat-alpha#a-standard- | format... | sebzim4500 wrote: | Sure, you can add a few tokens, but in this case they | changed almost every token. | espadrine wrote: | Surprisingly, their scaling law analysis still focuses on | training FLOPs instead of training + inference FLOPs. | | That said, they do mention this: | | > _The largest model in the PaLM 2 family, PaLM 2-L, is | significantly smaller than the largest PaLM model but uses more | training compute. [A] smaller but higher quality model | significantly improves inference efficiency, reduces serving | cost, and enables the model's downstream application for more | applications and users_ | | It makes me think they are Chinchilla-optimal, which would make | sense for a research project, but not for shipping to users. I am | surprised they didn't train to the validation loss plateau. | haldujai wrote: | Depends on your goal, if it's to overtake OpenAI as having the | best model overall it makes sense to optimize for training loss | alone (assuming a fixed upfront compute budget). | | Optimizing for inference to achieve the same loss would require | more compute overall so you're either paying upfront with | higher training costs or kicking the can down the road to | inference. | | News articles estimates of GPT4 cost seem to peg it at ~8 | months of inference to achieve 1:1 cost with training. Life | span of these models is TBD but it's a pretty safe bet we'll | have new ones by then. Of course GPT3.5 is still getting used | but probably won't cross 2:1ish in its lifetime. | | Might as well roll the dice and kick the can down the road if | you're Google, I imagine they would happily pay an extra | 500k/day in inference compute to be market leaders, whats | 183mill for them? But if they don't get any real market share | or the model sucks they saved substantially on training. | | > It makes me think they are Chinchilla-optimal, | | They elaborate in the appendix but they empirically determine | PaLM-optimal, which concurs with Chinchilla-optimal (more or | less). | jumpCastle wrote: | Optimazing for training could help distillation also. | sanxiyn wrote: | I agree distillation is the wild card. The question is | whether distillation works for LLM. I am not aware of any | public report of successful distillation of LLM (I searched | quite hard for this; if you know of any and can tell me I | would be very grateful), and I interpreted it to mean that it | doesn't work yet and negative results are not published due | to publication bias. | jumpCastle wrote: | The name 3.5-turbo sounds to me like it implies | distillation. The release notes at the time also hinted at | it IIRC. | sanxiyn wrote: | Well, that's why I said public. Personally, I don't think | release notes | https://help.openai.com/en/articles/6825453-chatgpt- | release-... hinted at any such thing, and I think | quantization is more likely than distillation. | kristianp wrote: | Does the turbo API being 10 times cheaper than davinci | imply anything? It implies more than just quantisation to | me. | mrbungie wrote: | This was published here in HN last week: | https://news.ycombinator.com/item?id=35810663 | | Don't know if there any public technical reports by any of | the big AI companies about this, as its pretty new. | sanxiyn wrote: | No, distilling step-by-step | https://arxiv.org/abs/2305.02301 distills LLM to task | specific model. That works, and I know of multiple | successes. But it doesn't relate to choice of optimizing | training FLOP vs training and inference FLOP, since the | resulting distilled model is not LLM. | fpgaminer wrote: | Off the top of my head there's DistilBERT from awhile back. | I also recall distilled GPT-2 models from before the GPT-3 | times. | sanxiyn wrote: | Yes, DistilBERT https://arxiv.org/abs/1910.01108 is in | fact the closest case I know of. But it is too small | (distilling from 110M to 66M) and both BERT and | DistilBERT is intended to be used (and benchmarked) with | separate fine tuning for specific tasks, so they are not | general. | [deleted] ___________________________________________________________________ (page generated 2023-05-10 23:00 UTC)