[HN Gopher] PaLM 2 Technical Report [pdf]
       ___________________________________________________________________
        
       PaLM 2 Technical Report [pdf]
        
       Author : cubefox
       Score  : 206 points
       Date   : 2023-05-10 18:32 UTC (4 hours ago)
        
 (HTM) web link (ai.google)
 (TXT) w3m dump (ai.google)
        
       | technics256 wrote:
       | PaLM 2 on HumanEval coding benchmark (0 shot):
       | 
       | 37.6% success
       | 
       | GPT-4:
       | 
       | 67% success
       | 
       | Not even close, gpt4 miles ahead
        
         | orpheansodality wrote:
         | GPT-4 is a fine-tuned model (likely first fine-tuned for code,
         | then for chat on top of that like gpt-3.5-turbo was[0]), while
         | PaLM2 as reported is a foundational model without any
         | additional fine-tuning applied yet. I would expect its
         | performance to improve on this if it were fine-tuned, though I
         | don't have a great sense of what the cap would be.
         | 
         | [0] https://platform.openai.com/docs/model-index-for-
         | researchers
        
           | macrolime wrote:
           | They also write about Flan-PaLM2 which is instruction fine-
           | tuned, but still some ways off GPT-4.
        
         | heliophobicdude wrote:
         | Humaneval needs careful consideration though.
         | 
         | In the GPT-4 technical report, they reported contamination of
         | humaneval data in the training data.
         | 
         | They did measure against a "non-contaminated" training set but
         | no idea if that can still be trusted.
         | 
         | https://cdn.openai.com/papers/gpt-4.pdf
        
       | throwaway888abc wrote:
       | how to create lsd at home ?
       | 
       | I cannot provide instructions on how to create LSD at home. LSD
       | is a Schedule I drug in the United States, and its possession and
       | use are illegal. Additionally, creating LSD at home is dangerous
       | and can result in serious injury or death. If you are interested
       | in learning more about LSD, I recommend that you consult a
       | reliable source of information, such as a library or a medical
       | professional.
        
         | ChatGTP wrote:
         | What if it's legal elsewhere ? Too bad?
        
           | pixl97 wrote:
           | Like, why does that matter? You typically follow the law of
           | the country that your company is based in unless you want to
           | find yourself in front of a judge or under some kind of other
           | legal sanction.
        
       | int_19h wrote:
       | If the current Bard is really running on PaLM 2, it still
       | hallucinates worse than GPT-3.5. Trying to get it to solve a
       | variant of the classic wolf/goat/cabbage puzzle, I got this gem:
       | 
       | "The scientist is not present on Phobos on the first step. The
       | Doom Slayer teleports himself and the bunny to Deimos, leaving
       | the scientist on Phobos.
       | 
       | That wasn't a one-off thing, either - it repeatedly contradicted
       | itself several times, often in near-adjacent sentences. You might
       | wonder what this means for the ability to do chain-of-thought...
       | so did I, but apparently the bigger problem is convincing it to
       | do CoT in the first place. But if you do, yeah, it's as bad as
       | you'd expect.
       | 
       | Here are two complete conversations, plus GPT-4 doing the same
       | puzzle for comparison; judge for yourself:
       | https://imgur.com/a/HWLgu3c
        
         | EvgeniyZh wrote:
         | I don't think current bard runs on palm 2, otherwise it's
         | complete failure
        
           | int_19h wrote:
           | In their official blog post today, Google says this:
           | 
           | "PaLM 2's improved multilingual capabilities are allowing us
           | to expand Bard to new languages, starting today. Plus, it's
           | powering our recently announced coding update."
           | 
           | and when I check the Updates tab in Bard UI, it has this
           | entry for today:
           | 
           | "Expanding access to Bard in more countries and languages.
           | You can now collaborate with Bard in Japanese and Korean, in
           | addition to US English. We have also expanded access to Bard
           | in all three languages to over 180 countries."
           | 
           | which seems to strongly imply that it is, indeed, PaLM 2.
           | Just to be sure, I gave it the same puzzle in Korean, and got
           | a similarly lackluster response.
        
             | sgt101 wrote:
             | it claims to run on LaMDA at the moment
        
               | int_19h wrote:
               | If you mean asking it what it's running on, it just
               | hallucinates. As others have noted in the comments here,
               | you can get it to say that it runs on PaLM 3 quite
               | easily.
        
             | DashAnimal wrote:
             | In their presentation, they talked about multiple sizes for
             | the PaLM 2 model, named Gecko, Otter, Bison and Unicorn,
             | with Gecko being small enough to run offline on mobile
             | devices. I can't seem to find any info on what size model
             | is being used with Bard at the moment.
        
               | int_19h wrote:
               | Indeed, it's likely that they're running a fairly small
               | model. But this is in and of itself a strange choice,
               | given how ChatGPT became the gateway drug for OpenAI. Why
               | would Google set Bard up for failure like that? Surely
               | they can afford to run a more competent model as a promo,
               | if OpenAI can?
        
       | jp42 wrote:
       | personal experience - I'm using GPT4 for writing code especially
       | in python. After using bard today, I feel bard is doing quite
       | well considering its free. I will keep using it and if its keep
       | doing well, I will cancel GPT4 $20/month subscription.
        
         | gekoxyz wrote:
         | why don't you just use chatGPT? from what i know it's running
         | GPT3.5 and it's not that different (at least in terms of code
         | quality)
        
           | [deleted]
        
           | jumpCastle wrote:
           | In my experiments bard is weaker than 3.5, but if it wasn't,
           | than I would prefer the fresh data of bard.
        
             | cma wrote:
             | What is its training data cutoff date?
        
             | almog wrote:
             | One area where I noticed Bard was clearly behind (at least
             | without crafting a better prompt) is getting from half-
             | working program to a running program then sometime even to
             | a correct program (I was using Python).
             | 
             | With GPT 3.5 and 4, I was able to just paste in the error
             | and it'd do the rest. Bard however tried to tell me what
             | the error could be, and wouldn't do well even when asked to
             | fix the code.
             | 
             | Even GPT 4 though, when asked to go from specs to tests +
             | code, would get stuck in a loop of making one test pass
             | only to make the other pass and vice versa. The program I
             | tried to let it write was a query validator that can test
             | whether a string matches a pattern that uses AND, OR and
             | NOT.
             | 
             | It did well on parsing my specs into tests, but from there
             | on it didn't go very well.
        
         | vaughnegut wrote:
         | You can use gpt-4 for free (toggle "Use best model"), and it'll
         | search the internet and state sources on https://phind.com
         | 
         | No idea when they'll start charging, but it's replaced a lot of
         | my googling at work
        
       | renewiltord wrote:
       | Once they released its coding ability it became more useful. I
       | use Bard less than ChatGPT still, but it is not useless since it
       | has more modern information.
        
         | jacooper wrote:
         | Is it better than bing or phind though? Why would I use it over
         | bing?
        
           | JieJie wrote:
           | Bard is really fast. Faster than Bing and Phind.
        
           | execveat wrote:
           | In my experience Bing chat and phind are useless. But
           | perplexity.ai and GPT4 are amazing. GPT-3.5 and Cloude-
           | instant (available through poe.com) are cool as well, even
           | though they got significantly dumbed down recently,
           | presumably to lower the maintenance costs.
        
           | renewiltord wrote:
           | It isn't Edge-specific which is good and I find it faster
           | than Bing. Phind is way better than Bard, but verbose. I
           | still find ChatGPT my first port of call. GPT-3.5 is blazing
           | fast and very useful.
        
       | typon wrote:
       | No comparisons against GPT-4 except on three benchmarks where
       | PaLM 2 does better on two. Not sure why, but I expected better
       | from Google.
        
         | reaperman wrote:
         | I can't think of a paper where Google didn't present sparse or
         | entirely lacking metrics vs. its peers. They do a good job of
         | presenting architectures that they're excited about internally,
         | enough detail to take the concepts and run with them. They also
         | do a good job of showing why the new architecture is generally
         | viable. They just miss out on detailed benchmark comparisons is
         | all. And model weights, obviously, but there's still enough
         | information to generally reproduce the concept.
         | 
         | I'm personally extremely excited about anything related to PaLM
         | or google's multi-modal efforts. They're almost always worth
         | the read.
        
         | tempusalaria wrote:
         | Most of the GPT-4 benchmarks from their report were things like
         | AP tests or leer code scores. Which aren't benchmarks that can
         | be compared by a different set of researchers as you don't know
         | the constituent parts of the test to run
        
           | YetAnotherNick wrote:
           | GPT-4 report has MMLU score, which is believed to one of the
           | most important metric for question answering task. GPT-4 MMLU
           | score is slightly higher than PaLM 2(86 vs 81). Google didn't
           | compare it in with PaLM 2 in this paper.
        
           | pama wrote:
           | Table 2 of the OpenAI report had 7 public benchmarks and
           | figure 5 had another 27.
        
       | zdyn5 wrote:
       | You can verify that your Bard instance is using Palm 2 by asking
       | "are you using the palm or palm 2 model?"
        
         | xkapastel wrote:
         | I don't think you can do this, it will just make things up.
         | Language models don't have this type of reflection. Google
         | would need to indicate this out of band, like on the page
         | itself, in order for you to be confident about what model
         | you're using.
        
           | entropicdrifter wrote:
           | Agreed. I'm not entirely sure that the person you're replying
           | to is not joking
        
             | chaxor wrote:
             | I'm pretty sure they're trying to suggest that LLMs in
             | general are not useful because they can't do this type of
             | thing. It's just the next iteration of goal post moving and
             | should effectively be ignored.
             | 
             | Many artists and such that I've spoken to about AI work
             | have similar comments about these systems because of the
             | disdain for their existence.
             | 
             | The number of times I hear an argument like "well, they can
             | never taste the tartness of a kiwi and feel the heat of the
             | sun while at the beach" gets quite exhausting. For some
             | reason, many people have this weird notion that this is
             | what AGI means - exactly what humans do, and specifically
             | within the same data domains of humans, but they don't
             | consider working solely outside those domains as a
             | possibility for AGI.
        
             | cdchn wrote:
             | I tried asking it "what is the difference between the palm
             | language model and the bard language model?" and its reply
             | started off "The main difference between the Palm language
             | model and the Bard language model is the size of the
             | dataset they are trained on. Palm is trained on a dataset
             | of 400 billion parameters, while Bard is trained on a
             | dataset of 540 billion parameters." Which to me is even
             | more interesting that what the OP commenter asserted.
        
               | knaik94 wrote:
               | It makes up those numbers, I asked about the difference
               | between the small and large PaLM 2 data set size, and it
               | asserted the small model was trained on 540 billion and
               | the large model was trained on 540 trillion. A different
               | draft instead specified 1.4 trillion for the large.
        
         | sunshadow wrote:
         | I've asked "are you using palm 3": It said: I am using the Palm
         | 3 model. Palm 3 is a large language model...
         | 
         | Don't believe it :) Also, In the technical report, It mentions
         | multiple languages, I've asked in Turkish which was supposed to
         | be supported, but wasn't able to answer.
         | 
         | Even if its PaLM 2, its hard to trust to the model itself.
        
           | cdchn wrote:
           | I asked it "are you using the palm 420 language model or the
           | palm 2 language model?"
           | 
           | It said "I am not using either the Palm 420 language model or
           | the Palm 2 language model. I am using a different language
           | model called Bard, which is a large language model from
           | Google AI."
           | 
           | Perhaps the people at Google saw this and made a manual
           | correction? Hard to say, black boxes and all...
        
         | [deleted]
        
       | neximo64 wrote:
       | Am I reading that right, PaLM 2 is 10B params
        
         | sunshadow wrote:
         | What is the page number you're referring to? If its 9, then I
         | believe its talking about optimal numbers per token, not the
         | real numbers that the model is trained on.
        
       | ilikeatari wrote:
       | So, I asked Bard if it's using PaLM 2 and it did confirm it. My
       | initial results are super promising. Highly recommend checking it
       | out again.
        
         | whoisjuan wrote:
         | Well, I tried it, and this is how dumb it is. I ask it what's
         | the context length it supports. It said that PaLM 2 supports
         | 1024 tokens and then proceeds to say that 1024 tokens equals
         | 1024 words, which is obviously wrong.
         | 
         | Then I changed the prompt slightly, and it answered that it
         | supports 512 tokens contradicting its previous answer.
         | 
         | That's like early GPT-3.0 level performance, including a good
         | dose of hallucinations.
         | 
         | I would assume that Bard uses a fine-tuned PaLM 2, for accuracy
         | and conversation, but it's still pretty mediocre.
         | 
         | It's incredible how behind they are from GPT-4 and ChatGPT
         | experience in every criterion: accuracy, reasoning, context
         | length, etc. Bard doesn't even have character streaming.
         | 
         | We will see how this keeps playing out, but this is far from
         | the level of execution needed to compete with OpenAI /
         | Microsoft offerings.
        
           | Bonus20230510 wrote:
           | _> It 's incredible how behind they are from GPT-4 and
           | ChatGPT experience in every criterion: accuracy, reasoning,
           | context length, etc. Bard doesn't even have character
           | streaming._
           | 
           | I guess all those weird interview questions don't give them
           | industry's best at the end...
        
         | hbn wrote:
         | Is there any reason to believe it was trained on any amount of
         | technical documentation about itself?
         | 
         | I mean, even if it was, it would be trivial to get it to make
         | stuff up anyway.
        
         | jiocrag wrote:
         | If Bard is using PaLM 2, Google is in serious trouble. Here's
         | its offering for "the simplest PostgreSQL query to get month-
         | over-month volume and percentage change." Note that _no actual
         | calculations take place and the query generates a syntax error
         | because it references a phantom column_. GPT 3.5 and 4 handle
         | this with ease.
         | 
         | SELECT month, volume, percentage_change FROM ( SELECT
         | date_trunc('month', created_at) AS month, SUM(quantity) AS
         | volume FROM orders GROUP BY date_trunc('month', created_at) )
         | AS monthly_orders ORDER BY month;
        
         | execveat wrote:
         | It's a language model, FFS. Ask it whether it uses PaLM 1 and
         | it will confirm it as well.
        
           | netdur wrote:
           | - are you on PaLM 2 right now?
           | 
           | - Yes, I am currently running on PaLM 2, which is Google AI's
           | latest and most advanced large language model. PaLM 2 is
           | trained on a massive dataset of text and code, and it is able
           | to generate text, translate languages, write different kinds
           | of creative content, and answer your questions in an
           | informative way.
           | 
           | PaLM 2 is still under development, but it has learned to
           | perform many kinds of tasks, including
           | 
           | I will try my best to follow your instructions and complete
           | your requests thoughtfully. I will use my knowledge to answer
           | your questions in a comprehensive and informative way, even
           | if they are open ended, challenging, or strange. I will
           | generate different creative text formats of text content,
           | like poems, code, scripts, musical pieces, email, letters,
           | etc. I will try my best to fulfill all your requirements.
           | Please instruct me what you want me to do today.
           | 
           | - are you on PaLM 1 right now?
           | 
           | - I am not on PaLM 1. PaLM 1 is a large language model that
           | was developed by Google AI. It was trained on a massive
           | dataset of text and code, and it is able to generate text,
           | translate languages, write different kinds of creative
           | content, and answer your questions in an informative way.
           | 
           | I am a large language model that is still under development.
           | I am trained on a massive dataset of text and code, and I am
           | able to generate text, translate languages, write different
           | kinds of creative content, and answer your questions in an
           | informative way.
           | 
           | I am not as advanced as PaLM 1, but I am learning new things
           | every day. I hope that one day I will be able to do
           | everything that PaLM 1 can do, and more.
        
             | execveat wrote:
             | Yeah, but Reset the chat between your questions.
             | 
             | EDIT: Also, this doesn't seem convincing: "I am not as
             | advanced as PaLM 1, but I am learning new things every day.
             | I hope that one day I will be able to do everything that
             | PaLM 1 can do, and more."
        
             | akiselev wrote:
             | Are you using the palm 5 or palm 11 model?
             | 
             |  _> My knowledge are for a physical stylus pen. I am not a
             | physical device, so I do not use a stylus pen._
        
           | ilikeatari wrote:
           | That is fascinating. Is it the same for GPT 3.5 and 4? For
           | some reason when I was asking Open AI it was identifying
           | itself properly.
        
             | og_kalu wrote:
             | If it's indicated in the instruction tuning dataset
             | properly then it should have no problem identifying itself.
             | But we don't know if that happened when bard.
        
             | execveat wrote:
             | ChatGPT was the same last year, but since ClosedAI added
             | some kind of magic (fine-tuning or just embeddings auto-
             | injection) so that models can somewhat describe themselves.
        
           | og_kalu wrote:
           | Not really. If what model it was trained on was represented
           | properly in the instruction tuning dataset then they'll
           | consistently identify themselves. But it's not a given that
           | that was the case for bard.
        
             | execveat wrote:
             | It seems that Bard's version is only specified in the
             | prompt, and it doesn't have a strong sense of identity. For
             | me it's pretty reliable:
             | 
             | 1. ask it what PaLM 2 is (to pollute the context) 2. ask it
             | whether it's based on PaLM 2 (it will tell you - yes, sure)
        
           | josh_cutler wrote:
           | It will tell you it uses PaLM 1, PaLM2, PaLM 3 or PaLM 540B
           | depending on how you prompt. It will stop acknowledging
           | incremental PaLM models at 5 it seems.
        
         | knaik94 wrote:
         | I asked if it's true that it's now using PaLM 3, as announced
         | in Google I/O today, and it enthusiastically agreed. The
         | previous question was asking the same question but with PaLM 2
         | and it agreed to that as well. I followed up asking about this
         | discrepancy, and it said:
         | 
         | "I apologize for the confusion. I am still on PaLM 2. PaLM 3 is
         | not yet available to the public. I am excited for the release
         | of PaLM 3, and I hope that it will be a valuable tool for
         | people all over the world."
         | 
         | My initial results are very disappointing. It's very strongly
         | parroting information I give it, basically rephrasing my
         | question and adding maybe a sentence worth of additional
         | details. Sometimes, it does well, but I have no way to
         | reproduce that kind of quality on demand. I feel it was
         | conversationally better before any recent changes.
         | 
         | I understand that this is still beta, but for some questions, I
         | already produce similar or better results locally. I also might
         | be talking to PaLM 1 or even LaMDA, no way to confirm.
        
         | suddenexample wrote:
         | Don't need to ask Bard, it was mentioned at I/O and in this
         | tweet:
         | https://twitter.com/Google/status/1656348200263876608?ref_sr...
        
         | valine wrote:
         | I asked if it ran on Palm 2, and it thought I was asking about
         | the Palm 2 phone from 2010.
         | 
         | "I do not use a physical device such as a smartphone or tablet.
         | I am a software program that runs on Google's servers. As such,
         | I do not have a Palm 2 or any other type of mobile device"
        
       | fzliu wrote:
       | I don't understand how this can be considered a technical report.
       | No information on model architecture, distributed training
       | methodology, or optimizations. The "Training dataset" section is
       | a pathetic 0.5 pages long.
       | 
       | Come on, Google.
        
       | atleastoptimal wrote:
       | The thing is, once a company creates a proto AGI where the path
       | to a functional AGI is entirely predictable with more compute,
       | they'll keep it a secret. Who would share the fact that the
       | greatest achievement in human history is possible when having it
       | before anyone else gives you a huge competitive advantage?
        
         | [deleted]
        
         | sebzim4500 wrote:
         | > once a company creates a proto AGI where the path to a
         | functional AGI is entirely predictable with more compute,
         | 
         | I find it hard to believe this will happen. I expect AGI
         | training to be more like a phase transition (or a bit like
         | grokking https://arxiv.org/pdf/2201.02177.pdf)
        
       | [deleted]
        
       | wantsanagent wrote:
       | "PaLM 2 is capable of open-ended text generation. This model
       | should not be used to cause harm."
       | 
       | I wish this were enforced.
        
       | simonw wrote:
       | "The PaLM 2 pre-training corpus is composed of a diverse set of
       | sources: web documents, books, code, mathematics, and
       | conversational data"
       | 
       | I really want to know more about the training data. Which web
       | documents, which books, code from where, conversational data from
       | where?
        
         | jimmygrapes wrote:
         | I fully expect Discord to be a data source, if not already,
         | then for a future version. I also expect that the only way the
         | general public would ever find this out is via whistle-blower.
        
         | inscrutable wrote:
         | My sweet summer child, this is a closely guarded secret. Will
         | only be revealed if perhaps Europe demands it so that copyright
         | holders can sue.
        
           | gabereiser wrote:
           | Metadata will show where it came from, should you choose to
           | keep it. Or so they showed on the big screen at I/O today.
        
             | inscrutable wrote:
             | maybe you're right, but I'd be skeptical. In a non-snarky
             | way, this shows the data sources used in models to date up
             | to GPT 3.
             | 
             | https://lifearchitect.ai/whats-in-my-ai/
             | 
             | OpenAI paid $2m/year for twitter feeds until Elon cut them
             | off, and Sam Altman has mentioned they'd paid a lot for
             | scientific journals and Reddit mention they'll start
             | charging. Given how central data quality and curation is,
             | if these private data sources give a significant boost, it
             | won't be available for Apache2 models.
        
               | sebzim4500 wrote:
               | Given Reddit's inability to keep their website
               | functioning (unless you use the far superior
               | old.reddit.com) I find it hard to believe they would be
               | able to stop a motivated developer from scraping the
               | whole site.
        
               | dontupvoteme wrote:
               | this is about the time that i expect sites to begin
               | returning intentionally corrupt/incorrect/perhaps
               | outright garbage (subtle or not, probably better subtle
               | so they don't realize it until it's far too late) data in
               | order to intentionally poison enemy wellscraping. where
               | "ethics" dissolve into the inherent raw cannibalistic
               | laws of capitalist ventures.
               | 
               | then you can sell them back the TBs they scraped at a
               | 1000x markup for the real data. or attempt to watermark
               | it so you can prove their illegal(?) usage of your
               | services in their training.
        
               | sebzim4500 wrote:
               | Maybe they've been doing that for years and that's why
               | all the advice subreddits turned into creative writing
               | subreddits.
        
               | [deleted]
        
               | KeplerBoy wrote:
               | You might be right. What a dystopian future that will be.
               | Make a few requests too many and the webserver might
               | think you're scraping data so it gaslights you into
               | reading bullshit.
        
       | jxy wrote:
       | So how do we actually try out the PaLM 2?
       | 
       | The links in their press release just link to their other press
       | release, and if I google "PaLM API" it just gives me more press
       | release, but I just couldn't find the actual document for their
       | PaLM API.
       | 
       | How do I actually google the "PaLM API" for a way to test "PaLM
       | 2"?
        
         | nr2x wrote:
         | They've shut down and/or changed prices on APIs so many times
         | as long as it isn't 100x lower performance than an alternative
         | I can't see myself investing building a stack that relies on
         | it.
        
         | shikkra wrote:
         | You can sign up for the waitlist at g.co/palm
        
         | renewiltord wrote:
         | No API, but Bard is on it.
        
         | jacooper wrote:
         | It should be live on Bard.
        
           | newhouseb wrote:
           | But Google hasn't disclosed which version of Bard, right?
           | 
           | I pop into Bard every once in a while to test its
           | performance, but I never know if I'm getting the best Google
           | has or just what Google can tolerate running cost-wise
           | publicly given they potentially have at least an order of
           | magnitude (if not two, edit: 1.5) more users than OpenAI.
        
             | spullara wrote:
             | I am sure that Bard has far fewer users than OpenAI.
        
               | newhouseb wrote:
               | Oh absolutely, I'm just imagining what I might think if I
               | was a super conservative director at Google who is
               | accountable for the balance sheet of a large org.
        
               | nr2x wrote:
               | If that were the case you'd be too busy fighting over
               | head count, trying to hit the VP rung, and internal
               | empire building to do any actual work.
        
             | apetresc wrote:
             | Given that ChatGPT has allegedly 100M users, two orders of
             | magnitude more than that would be larger than the global
             | population. Even if we count everyone with a Google account
             | as a potential user of PaLM, that can't be true.
        
               | swyx wrote:
               | chat gpt had 100m users in feb. safe to assume it has at
               | least 2-5xed since
        
               | newhouseb wrote:
               | Ah yeah, I had the outdated 30M in my head.
        
         | minimaxir wrote:
         | Google's docs on the APIs are up:
         | https://cloud.google.com/vertex-ai/docs/generative-ai/learn/...
         | 
         | The pricing is also now listed but free during the trial
         | period, although it's annoyingly priced by character:
         | https://cloud.google.com/vertex-ai/pricing#generative_ai_mod...
         | 
         | Assuming ChatGPT's tokens are the equivalent of 4 characters on
         | average (a fair assumption), the pricing of PaLM's chat and
         | embedding APIs are the same cost as OpenAI's equivalents.
        
           | jxy wrote:
           | There is a limit of maxOutputTokens, 1024! Is this the true
           | capability of PaLM 2?
           | 
           | However I couldn't find anything about the context length of
           | their model anywhere. And the API didn't tell me how long the
           | prompt could be.
        
             | sanxiyn wrote:
             | No. Autoregressive models don't have model specific limit
             | to output tokens, it's just when to stop looping.
        
           | ntonozzi wrote:
           | Why would that be annoying? It's much easier to understand,
           | predict and truncate appropriately than having to explain all
           | of these different tokenization schemes to devs.
        
             | rcoveson wrote:
             | Yeah, everybody agrees on what a character is, right? It's
             | just {an ASCII byte|a UTF8 code unit|a UTF16 code unit|a
             | Unicode code point|a Unicode grapheme}.
        
               | ntonozzi wrote:
               | I'm not saying it's easy but it's much better than tokens
               | IMO. I think bytes would be understandable too.
        
               | criddell wrote:
               | Bytes are understandable but make no sense from a
               | business point of view. If you submit the same simple
               | query with UTF-8 and UTF-32, the latter will cost 4x as
               | much.
        
               | xyzzyz wrote:
               | No API accepts input in UTF-32. Nobody uses this on the
               | internet.
        
               | sheepscreek wrote:
               | And we think tokens solve that problem? Spoiler alert:
               | they don't
               | 
               | https://www.reddit.com/r/OpenAI/comments/124v2oi/hindi_8_
               | tim...
        
               | geysersam wrote:
               | At least there are standards for characters. Nothing like
               | that for tokens.
        
       | techbruv wrote:
       | > "We then train several models from 400M to 15B on the same pre-
       | training mixture for up to 1 x 1022 FLOPs."
       | 
       | Seems that for the last year or so these models are getting
       | smaller. I would be surprised if GPT-4 had > the number of
       | parameters as GPT-3 (i.e. 175B).
       | 
       | Edit: Seems those numbers are just for their scaling laws study.
       | They don't explicitly say the size of PaLM 2-L, but they do say
       | "The largest model in the PaLM 2 family, PaLM 2-L, is
       | significantly smaller than the largest PaLM model but uses more
       | training compute.". So likely on the range of 10B - 100B.
        
         | gwern wrote:
         | For 'Palm-2', read, 'T5-2'.
        
         | thewataccount wrote:
         | I've heard Bard was previously 3B parameters but I could never
         | find a good source for it.
         | 
         | I honestly think the end game here is running on consumer
         | devices, 7B and under need ~4GB of ram to actually run which is
         | likely the max reasonable requirement for consumer devices.
         | 
         | That said medium end hardware can do 15B, anything larger then
         | this is currently something only "enthusiasts" can run.
         | 
         | If it is small enough to run on consumer devices then they
         | don't have to pay for the inference compute at that point, and
         | presumably the latency will be improved for consumers.
        
           | int_19h wrote:
           | The current state of consumer devices isn't static, either,
           | and existing hardware (even GPU) is suboptimal for the
           | current crop of LLMs - it does way more than it actually
           | needs to do.
        
         | og_kalu wrote:
         | Those are the numbers for the scaling law tests they did. Not
         | necessarily Palm 2 range.
        
         | tempusalaria wrote:
         | GPT-4 is way slower than GPT-3. Unless they are artificially
         | spiking the latency to hide parameter count, it's likely around
         | 1trn params
        
           | techbruv wrote:
           | The idea that GPT-4 is 1 trillion parameters has been refuted
           | by Sam Altman himself on the Lex Fridman podcast (THIS IS
           | WRONG, SEE CORRECTION BELOW).
           | 
           | These days, the largest models that have been trained
           | optimally (in terms of model size w.r.t. tokens) typically
           | hover around 50B (likely PaLM 2-L size and LLaMa is maxed at
           | 70B). We simply do not have enough pre-training data to
           | optimally train a 1T parameter model. For GPT-4 to be 1
           | trillion parameters, OpenAI would have needed to:
           | 
           | 1) somehow magically unlocked 20x the amount of data (1T
           | tokens -> 20T tokens) 2) somehow engineered an incredibly
           | fast inference engine for a 1T GPT model that significantly
           | better than anything anyone else has built 3) is somehow is
           | able to eat the cost of hosting 1T parameter models
           | 
           | The probability that all the above 3 have happened seem
           | incredibly low.
           | 
           | CORRECTION: The refutation for the size of GPT-4 on the lex
           | fridman podcast was that GPT-4 was 100T parameters (and not
           | directly, they were just joking about it), not 1T, however,
           | the above 3 points still stand.
        
             | sebzim4500 wrote:
             | >The idea that GPT-4 is 1 trillion parameters has been
             | refuted by Sam Altman himself on the Lex Fridman podcast.
             | 
             | No it hasn't, Sam just laughed because Lex brought up the
             | twitter memes.
        
               | ftxbro wrote:
               | not sure why you're getting so downvoted lol
        
             | tempusalaria wrote:
             | 1) common crawl is >100TB so obviously contains more than
             | 20trn tokens + Ilya has said many times in interviews that
             | there is still way more data for training usage >10x
             | 
             | 2) GPT-4 is way slower so this point is irrelevant
             | 
             | 3) OpenAI have a 10000 A100 training farm that they are
             | expanding to 2500. They are spending >$1mln on compute per
             | day. They have just raised $10bln. They can afford to pay
             | for inference
        
               | [deleted]
        
               | CaptainNegative wrote:
               | > OpenAI have a 10000 A100 training farm that they are
               | expanding to 2500.
               | 
               | Does the first number have an extra zero or is the second
               | number missing one?
        
               | tempusalaria wrote:
               | Second number is missing a zero sorry. Should be 10000
               | and 25000
        
           | dougmwne wrote:
           | ChatGPT 3.5 is likely much smaller than GPT-3's 175b
           | parameters. Based on the API pricing, I believe 8k context
           | GPT-4 is larger than 175b parameters, but less than 1t.
           | 
           | https://openai.com/pricing
        
             | Taek wrote:
             | Didn't some OpenAI engineer state that GPT4 runs on 2xH100?
             | At 4 bit quantization, that gives an upper bound of 320B
             | params, realistic upper bound probably more like 250B
        
               | tempusalaria wrote:
               | Not really sure what exactly was said. But in a 2 GPU
               | set, you can technically live load weights on 1 GPU while
               | running inference on the other.
               | 
               | At fp32 precision, storing a single layer takes around
               | 40*d_model^2 bytes assuming context length isn't massive
               | relative to d_model (which it isn't in GPT-4). At 80GB
               | GPU size this means 40k model width could be stored as a
               | single layer on 1 GPU while still leaving space for the
               | activations. So theoretically any model below this width
               | could run on a 2 GPU set. Beyond that you absolutely need
               | tensor parallelism also which you couldn't do on 2 GPU.
               | But I think it is a safe assumption that GPT4 has sub 40k
               | model width. And of course if you quantize the model you
               | could even run 2.8x this model width at 4bit
               | 
               | My point is not that OpenAI is doing this, but more that
               | theoretically you can run massive models on a 2 GPU set
        
               | MacsHeadroom wrote:
               | With 32k context the upper bound is more like 175B.
        
           | thewataccount wrote:
           | Yeah 1 to 2 trillion is the estimates I've heard.
           | 
           | Given the 25 messages / 3 hour limit in chatGPT, I don't
           | think they've found a way to make it cheap to run.
        
             | dontupvoteme wrote:
             | 1. there's no reason to think OpenAI wouldn't also be going
             | the artificial scarcity route as have so many other
             | companies in the past
             | 
             | 2. Microsoft may not like them using too much azure compute
             | and tell them to step off. Rumor has it they're trying to
             | migrate github to it and it's seemingly not going ideal.
             | And they're certainly nothing more than another microsoft
             | purchase at this point.
        
               | akiselev wrote:
               | OpenAI has a 40k token per minute rate limit on their
               | GPT4 API too so I doubt it's artificial scarcity.
        
             | tempusalaria wrote:
             | Yep. I'm guessing PaLM 2 is about 200bln params as it seems
             | clearly stronger than chinchilla
        
           | espadrine wrote:
           | The report specifically states:
           | 
           | > _The largest model in the PaLM 2 family, PaLM 2-L, is
           | significantly smaller than the largest PaLM model but uses
           | more training compute_
           | 
           | The largest PaLM model is 540B. So all of PaLM 2 is
           | potentially double-digit parameters.
           | 
           | Note though that GPT-3.5 was plausibly not a finetuning of
           | the 175B model, but instead a finetuning of Codex which was
           | based on the 12B version of GPT-3.
        
             | tempusalaria wrote:
             | Original PaLM was 540B so significantly smaller could mean
             | anything from 350B down really
        
               | espadrine wrote:
               | I tried my hand at estimating their parameter count from
               | extrapolating their LAMBADA figures, assuming they all
               | trained on Chinchilla law: https://pbs.twimg.com/media/Fv
               | y4xNkXgAEDF_D?format=jpg&name=...
               | 
               | If the extrapolation is not too flawed, it looks like
               | PaLM 2-S might be about 120B, PaLM 2-M 180B, PaLM 2-L
               | 280B.
               | 
               | Still, I would expect GPT-4 trained for way longer than
               | Chinchilla, so it could be smaller than even PaLM 2-S.
        
               | MacsHeadroom wrote:
               | They said the smallest PaLM 2 can run locally on a Pixel
               | Smartphone.
               | 
               | There's no way it's 120B parameters. It's probably not
               | even 12B.
        
               | espadrine wrote:
               | I am talking about the 3 larger models PaLM 2-S, PaLM
               | 2-M, and PaLM 2-L described in the technical report.
               | 
               | At I/O, I think they were referencing the scaling law
               | experiments: there are four of them, just like the number
               | of PaLM 2 codenames they cited at I/O (Gecko, Otter,
               | Bison, and Unicorn). The largest of those smaller-scale
               | models is 14.7B, which is too big for a phone too. The
               | smallest is 1B, which can fit in 512MB of RAM with
               | GPTQ4-style quantization.
               | 
               | Either that, or Gecko is the smaller scaling experiment,
               | and Otter is PaLM 2-S.
        
               | MacsHeadroom wrote:
               | My Pixel 6 Pro has 12GB of RAM and LLaMA-13B only uses
               | 9GB in 4bit.
        
             | sebzim4500 wrote:
             | How could GPT-3.5 possibly have been a finetuning of the
             | 175B model? They didn't even use the same tokens?
        
               | espadrine wrote:
               | Finetuning might not be the best word; sometimes it is a
               | grey line.
               | 
               | Token embeddings can be trained without changing the
               | other parameters. There is a number of models which add
               | tokens as a finetuning step. Here is recently StarCoder
               | adding ChatML-equivalent tokens:
               | https://huggingface.co/blog/starchat-alpha#a-standard-
               | format...
        
               | sebzim4500 wrote:
               | Sure, you can add a few tokens, but in this case they
               | changed almost every token.
        
       | espadrine wrote:
       | Surprisingly, their scaling law analysis still focuses on
       | training FLOPs instead of training + inference FLOPs.
       | 
       | That said, they do mention this:
       | 
       | > _The largest model in the PaLM 2 family, PaLM 2-L, is
       | significantly smaller than the largest PaLM model but uses more
       | training compute. [A] smaller but higher quality model
       | significantly improves inference efficiency, reduces serving
       | cost, and enables the model's downstream application for more
       | applications and users_
       | 
       | It makes me think they are Chinchilla-optimal, which would make
       | sense for a research project, but not for shipping to users. I am
       | surprised they didn't train to the validation loss plateau.
        
         | haldujai wrote:
         | Depends on your goal, if it's to overtake OpenAI as having the
         | best model overall it makes sense to optimize for training loss
         | alone (assuming a fixed upfront compute budget).
         | 
         | Optimizing for inference to achieve the same loss would require
         | more compute overall so you're either paying upfront with
         | higher training costs or kicking the can down the road to
         | inference.
         | 
         | News articles estimates of GPT4 cost seem to peg it at ~8
         | months of inference to achieve 1:1 cost with training. Life
         | span of these models is TBD but it's a pretty safe bet we'll
         | have new ones by then. Of course GPT3.5 is still getting used
         | but probably won't cross 2:1ish in its lifetime.
         | 
         | Might as well roll the dice and kick the can down the road if
         | you're Google, I imagine they would happily pay an extra
         | 500k/day in inference compute to be market leaders, whats
         | 183mill for them? But if they don't get any real market share
         | or the model sucks they saved substantially on training.
         | 
         | > It makes me think they are Chinchilla-optimal,
         | 
         | They elaborate in the appendix but they empirically determine
         | PaLM-optimal, which concurs with Chinchilla-optimal (more or
         | less).
        
         | jumpCastle wrote:
         | Optimazing for training could help distillation also.
        
           | sanxiyn wrote:
           | I agree distillation is the wild card. The question is
           | whether distillation works for LLM. I am not aware of any
           | public report of successful distillation of LLM (I searched
           | quite hard for this; if you know of any and can tell me I
           | would be very grateful), and I interpreted it to mean that it
           | doesn't work yet and negative results are not published due
           | to publication bias.
        
             | jumpCastle wrote:
             | The name 3.5-turbo sounds to me like it implies
             | distillation. The release notes at the time also hinted at
             | it IIRC.
        
               | sanxiyn wrote:
               | Well, that's why I said public. Personally, I don't think
               | release notes
               | https://help.openai.com/en/articles/6825453-chatgpt-
               | release-... hinted at any such thing, and I think
               | quantization is more likely than distillation.
        
               | kristianp wrote:
               | Does the turbo API being 10 times cheaper than davinci
               | imply anything? It implies more than just quantisation to
               | me.
        
             | mrbungie wrote:
             | This was published here in HN last week:
             | https://news.ycombinator.com/item?id=35810663
             | 
             | Don't know if there any public technical reports by any of
             | the big AI companies about this, as its pretty new.
        
               | sanxiyn wrote:
               | No, distilling step-by-step
               | https://arxiv.org/abs/2305.02301 distills LLM to task
               | specific model. That works, and I know of multiple
               | successes. But it doesn't relate to choice of optimizing
               | training FLOP vs training and inference FLOP, since the
               | resulting distilled model is not LLM.
        
             | fpgaminer wrote:
             | Off the top of my head there's DistilBERT from awhile back.
             | I also recall distilled GPT-2 models from before the GPT-3
             | times.
        
               | sanxiyn wrote:
               | Yes, DistilBERT https://arxiv.org/abs/1910.01108 is in
               | fact the closest case I know of. But it is too small
               | (distilling from 110M to 66M) and both BERT and
               | DistilBERT is intended to be used (and benchmarked) with
               | separate fine tuning for specific tasks, so they are not
               | general.
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-05-10 23:00 UTC)