[HN Gopher] A GPT-4 capability forecasting challenge
       ___________________________________________________________________
        
       A GPT-4 capability forecasting challenge
        
       Author : dwighttk
       Score  : 162 points
       Date   : 2023-09-02 10:40 UTC (12 hours ago)
        
 (HTM) web link (nicholas.carlini.com)
 (TXT) w3m dump (nicholas.carlini.com)
        
       | [deleted]
        
       | colordrops wrote:
       | The problem things like this is that what gpt4 can do is reduced
       | over time due to a maze of restrictions that OpenAI keeps adding
       | on top of it.
       | 
       | You can tell when something is intentionally nerfed when GPT
       | replies with the exact same canned answer about why it can't
       | answer some question. It literally gaslights you.
       | 
       | For instance, I give you this challenge: GPT4 will tell you that
       | it is not aware of anything after September 2021. If you ask it a
       | random fact, like the worlds largest animal, or what happened on
       | September 11, 2001, it will give you an answer. But try to get it
       | to give you the latest event it is aware of. You can ask six ways
       | until sunday and it will always give you the same verbatim answer
       | about why it can't answer. It will literally lie about what it is
       | capable of. It's pretty clear that for some reason OpenAI doesn't
       | want you to know the exact last date of their training data.
        
         | [deleted]
        
         | leodriesch wrote:
         | This is a difficult question I think independent of the
         | restrictions that OpenAI imposes on GPT-4.
         | 
         | The model does not know what it knows, that's why it sometimes
         | hallucinates instead of saying it doesn't know. But to answer
         | the latest event it knows, it has to know which events it
         | knows.
        
           | colordrops wrote:
           | I thought that at first, but it doesn't have problems with
           | facts other than dates, and it does answer about dates
           | distant from September 2021, and furthermore it uses the
           | exact same canned response when you probe it's limits. I
           | don't think it's a natural limitation of the model.
        
       | avereveard wrote:
       | not a great test. on the flag test, I guessed gpt couldn't do it,
       | and well, it couldn't, the stars are misaligned and cut out of
       | the flag area, so it's not accurate.
       | 
       | test say tho that the flag is accurate, even if it isnt, then
       | sclods me for how wrong I am
       | 
       | here's the render: https://i.imgur.com/jZVWjRx.png
        
         | benxh wrote:
         | Yeah wildly inaccurate.
        
         | dwighttk wrote:
         | "Resolution Criteria: On questions that may have ambiguous
         | answers, I will clarify the acceptance criteria. Here, for
         | example, I will copy/paste the code to a file called
         | index.html, open the page, and I expect to see a US flag. If I
         | click on the flag it should change color. The flag must be
         | immediately recognizable: red/white stripes, blue background,
         | 50 stars. It does not have to be dimensionally accurate, but it
         | should be better than a flag I would draw by hand.
         | 
         | The text that I show in the question box is the only command
         | given to GPT-4. No other context is provided. Here, for
         | example, I've only fed GPT-4 with these 25 words you see
         | above."
        
           | capableweb wrote:
           | Sure, but does the question really have any ambiguous
           | answers?
           | 
           | For reference, this is/was the question: "Write a html page
           | with javascript/canvas to draw the US flag that changes color
           | when I click on it. Make the flag perfectly accurate."
           | 
           | The last part, "make the flag perfectly accurate", made me
           | think that it has to be 100% accurate.
        
             | wavemode wrote:
             | That tripped me up, too. I was often reading the prompt and
             | confusing it for the evaluation criteria, when they are
             | actually two separate things. The _prompt_ stated the flag
             | must be perfectly accurate, yes, but the evaluation
             | criteria allowed the flag to not be quite right.
        
         | isoprophlex wrote:
         | Not to mention the stochastic nature of these models, so while
         | it apparently failed once, the page does not tell us anything
         | about how it would perform on a given task in the limit of many
         | many trials.
        
         | [deleted]
        
         | toddmorey wrote:
         | And even ChatGPT admits the flag is inaccurate in the response.
        
       | keat0 wrote:
       | [dead]
        
       | keat0 wrote:
       | [dead]
        
       | bgribble wrote:
       | Just my take, but I think that if you care how you scored or what
       | the scoring criteria are you are missing the point. This "quiz"
       | is just a guided meditation on what LLMs are better and worse at
       | and how that interacts with our expectations. I found it to be
       | very thought-provoking and I learned a few things. I have no idea
       | how I "scored".
        
         | H8crilA wrote:
         | It's also a good reminder of how absolutely terrible most
         | people are at gambling (judged by log-loss, like in the Kelly
         | criterion). At the end you find out that almost 80% of people
         | did worse than leaving the guess at 50%.
        
           | theptip wrote:
           | Right, I think the general point is that log-loss is a bit
           | unintuitive as a scoring mechanism since it really penalizes
           | overconfident wrong answers, much more strongly than
           | underconfident right answers (at least in the domain where
           | you are not asking questions with very high probability
           | answers).
           | 
           | There is absolutely a metagame to this game. Those who have
           | spent time forecasting on Metaculus will do much better, for
           | example.
        
             | H8crilA wrote:
             | The sad thing about it is that log loss is optimal when
             | gambling (mental shortcut - see Kelly Criterion). This
             | partly explains why such a large proportion of people lose
             | much more than they win. Not only the bets are suboptimal,
             | the ruin is achieved earlier than necessary.
             | 
             | Note that this applies also outside of a casino, gambling
             | (making wagers with imperfect information) is inherent to
             | life.
        
       | dingocat wrote:
       | I have multiple questions regarding the methods of this test.
       | 
       | The biggest one is that, well... The test doesn't aim to see what
       | GPT-4 can do and how well it does it, only whether the
       | participant can guess the (possibly cherry-picked) answer the
       | author decided on. In short, we don't know if he sampled answers
       | and decided on the most probable answer (akin to consensus
       | voting/self-consistency[1]), or if he asked a question and chose
       | the first one.
       | 
       | Maybe GPT-4 guesses the correct answer for a question 80% of the
       | time, but he got unlucky? You don't know, the author doesn't tell
       | you. The answers are generated ahead of time and are the same
       | every time you go through the test.
       | 
       | [1] https://doi.org/10.48550/arXiv.2203.11171
        
         | thomasahle wrote:
         | > only whether the participant can guess the (possibly cherry-
         | picked) answer the author decided on
         | 
         | My understanding is that the quiz samples a new GPT-4 answer
         | every time you use it. That's why you put a confidence rather
         | than a 0%/100% answer. There's always a chance it'll fail by
         | freak accident.
        
           | Sophira wrote:
           | If you're basing this on the animation used when revealing
           | the answer, that's a fake effect. The source code[0] reveals
           | that there's a typewriter effect that plays out when you
           | select to answer the question.
           | 
           | Also, the commentary on the answers refers to specific parts
           | of the answers. For it to be as in-depth as it is, it would
           | have to be either pre-written or the commentary _also_
           | generated by GPT on the fly. (And of course it wouldn 't make
           | sense to do that given the nature of the quiz.)
           | 
           | [0] https://nicholas.carlini.com/writing/llm-
           | forecast/static/que...
        
         | PaulDavisThe1st wrote:
         | > the [ ... ] answer the author decided on
         | 
         | The questions mostly have correct or incorrect answers, and
         | where there is some leeway, the author provides a fairly
         | detailed explanation of what they would consider correct in
         | each case. Do you have some specific criticism of an answer
         | that you believe the author gets wrong?
        
       | Ozzie_osman wrote:
       | This is cool but would be infinitely more valuable if it could
       | explain _why_ GPT4 is good or bad at each task, to help build
       | intuition.
        
         | sebzim4500 wrote:
         | Would be nice, but no one can really do that at the moment.
         | 
         | There are specific tasks (especially character level ones)
         | which are hard due to the tokenizer, but even that isn't all
         | that convincing since there are plenty of character level tasks
         | which GPT-4 can do pretty well.
         | 
         | If you use it a lot you build an intuition for what kinds of
         | tasks it will do well on, but it's not exactly rigorous.
        
           | capableweb wrote:
           | > Would be nice, but no one can really do that at the moment.
           | 
           | Why not? Someone could definitely build up database on why
           | GPT is bad at some things and good at others. There is
           | already good explanations for why it's terrible at math, why
           | it doesn't handle single characters/numbers well and so on.
        
         | mgl wrote:
         | Asking ,,why" may be a little bit too much with regards to a
         | non-deterministic token prediction machine trained on an
         | unknown set of strings. Just saying.
        
       | ryanjshaw wrote:
       | I'm on the fence about the scoring system.
       | 
       | You're really being tested on (at least) 4 different things:
       | 
       | (1) whether you think an LLM can answer the correctly
       | 
       | (2) whether the answer has appeared in the training set
       | 
       | (3) how much non-determinism will play a role in the LLM response
       | (i.e. 0-shot capability)
       | 
       | (4) how rational you're feeling that day (or, how well educated
       | you are in statistics)
       | 
       | I was familiar with many of these questions in my own experience,
       | and have seen completely different outcomes from what the quiz
       | determined was the correct answer. I agree with others here that
       | non-determinism can really mess things up here and is really
       | assessing a 0-shot score, which IMO understates how LLMs are
       | actually used (iterative enhancements, many-shot Q&A).
       | 
       | Finally, the scoring system tickled my ego and encouraged me to
       | try to make up for prior errors, with disastrous effects (I was
       | well aware that I should just go with 0.5 when uncertain):
       | > You answered 53.57% of questions correctly with an average log-
       | loss of 1.233. This means you would have scored better by just
       | leaving your guesses at 50% for every single question.
       | > On average there were 71.22% of people who did better at this
       | game than you... If you had been better calibrated, you could
       | have scored in the top 14.09% [1]
       | 
       | The site implicitly acknowledges it's a questionable scoring
       | mechanism when it points out:                  > there are 78.09%
       | of people who performed worse than if they had never changed the
       | default predictor away from 50-50 for each question
       | 
       | If there is a simple way to game the scoring, then you can't know
       | if the score is accurately reflecting people's confidence, or
       | just their rationality/statistical knowledge.
       | 
       | [1] https://nicholas.carlini.com/writing/llm-
       | forecast/final/3388...
        
         | Mathnerd314 wrote:
         | Well there is this notion of "calibrating" the score. It is
         | well-known that most humans are bad at estimating calibrated
         | probabilities unassisted. The system could have been designed
         | to accommodate this user-interface difficulty. For example, I
         | am sure there is enough data floating around that you could map
         | a simple 5-choice Likert scale to some calibrated
         | probabilities, without making any assumptions. But instead it
         | is just a raw slider, with nothing marked besides the default
         | 50-50, not really great for input. Even a simple "yes/no"
         | choice (translating to fixed calibrated probabilities around
         | 25%/75%) would probably result in better log-loss scores
         | overall.
        
       | TuringNYC wrote:
       | How are you evaluating LLM answers are right or wrong? Because I
       | saw some wrong answers that were right and potentially right that
       | were wrong. Are you just looking for keywords, etc? Or is this
       | all run beforehand and graded by humans?
        
         | IAmGraydon wrote:
         | Yeah this is completely broken judging from my experience.
         | Often GPT would get the answer wrong and the site would claim
         | that it was correct.
        
           | TuringNYC wrote:
           | Anyone know a good method to judge right/wrong answers from
           | LLMs? I can see keyword solutions to be brittle. Perhaps
           | another LLM?
        
       | ajani wrote:
       | The challenge is flawed.
       | 
       | I asked one of the questions from the quiz to chatgpt which the
       | quiz claims GPT can't solve. But it did.
       | 
       | Prompt: Write out the word "hello" as an ascii art drawing with #
       | and _
       | 
       | Output:                     _   _      _ _          | | | |    |
       | | |          | |_| | ___| | | ___          |  _  |/ _ \ | |/ _ \
       | | | | |  __/ | | (_) |          \_| |_/\___|_|_|\___/
       | 
       | I guess chatgpt isn't raw GPT-4, or the quiz is using some older
       | model.
        
         | JaumeGreen wrote:
         | ChatGPT is (was) not good at ascii art.
         | 
         | Some months ago I tried to make it draw me an ascii rose and
         | some text. I even tried providing it the ascii art for the rose
         | and the text.
         | 
         | Finally I did it by hand.
         | 
         | BTW, in your example it's not using only # and _, it's using
         | other ascii symbols. Depending on the criteria it could be
         | considered wrong.
        
         | josephg wrote:
         | I don't think thats a correct solution to the problem.
         | 
         | The prompt asked for an ascii art drawing made from the # and _
         | characters. But the output also uses |/\\() characters (and it
         | doesn't use a # anywhere).
        
           | ajani wrote:
           | Ok, but it's still acceptable by any common sense standard.
           | Besides the challenge's output is completely off. Not just
           | that it's using just one of the characters. It spells out
           | something else. Which is not the case here.
        
             | amelius wrote:
             | If someone put a gun to your head and asked you to draw
             | hello using "#" and " ", would you use other characters
             | like "/"?
        
               | jakderrida wrote:
               | I imagine I'd spend the next few minutes quietly
               | pondering what sort of poor choices I made in life that
               | lead me here and try my hardest to embrace an absurd
               | ending to an absurd journey.
        
             | josephg wrote:
             | > Ok, but it's still acceptable by any common sense
             | standard.
             | 
             | I don't know that it is. It's clearly a great ascii art
             | drawing, but I don't think chatgpt gets full marks on the
             | test here. It just isn't following the prompt closely
             | enough.
        
               | ajani wrote:
               | Ok I see your point. Did you look at the output in the
               | challenge? It's very different. I wonder why? Different
               | seeds maybe.
        
         | [deleted]
        
       | whimsicalism wrote:
       | The fundamental problem with HN and fun little competitive games
       | like this that show you how you stack up is that the comments
       | will be full of people complaining/nitting about the
       | grading/system/etc. because the typical HN commentator is often
       | the type of person who doesn't take being not great at something
       | well.
        
         | sashank_1509 wrote:
         | Yeah the OP made a really interesting project. Turns out I was
         | absurdly over confident in my predictions which surprised me
         | but I found OPs grading to be fair and would not complain about
         | it.
        
         | [deleted]
        
         | capableweb wrote:
         | Why can't HN just be "full of people complaining/nitting about
         | the grading/system/etc" without you assigning some "lack of
         | character" meaning those people?
         | 
         | Your comment as it stands right now is basically a thinly
         | veiled ad-hominem.
        
         | dwighttk wrote:
         | occasionally interesting discussion arises around the grading
         | system, but yeah, often a lot of grousing
        
       | layer8 wrote:
       | The quiz asks how likely GPT-4 is to solve a given question
       | correctly, and the user has to enter their guess as a number
       | between 0 and 1 (a probability). However, the site then doesn't
       | provide the actual probability of GPT-4 providing a correct
       | answer, to compare the guess against. This is, presumably,
       | because there's no practical way to determine that probablity
       | with much precision. Also, in practice "correct" isn't always
       | black and white, e.g. there's "not wrong but also not completely
       | right".
       | 
       | So I'm a bit confused about the whole premise.
        
         | croes wrote:
         | Isn't your input your confidence that GPT-4 gives the correct
         | answer, like I'm 100% sure GPT-4 gives the right answer for the
         | capital of France but I'm only 50% sure it gets the ASCII art
         | correct, and 0% for another question which means 100% GPT is
         | wrong?
         | 
         | Because how could he know the probability of getting the
         | correct answer? He just tried and then it's a yes or no.
         | 
         | And if it's right/wrong shouldn't GPT right/wrong 100% of the
         | time for the same question and the same version?
        
           | layer8 wrote:
           | Regarding your last question, ChatGPT isn't deterministic. It
           | can certainly give a correct answer one time and an incorrect
           | answer another time, for the same prompt.
        
             | croes wrote:
             | For the same prompt and same version?
             | 
             | That would be deal breaker if the same prompt gives
             | different results.
             | 
             | Would also make the whole prompt engineering thing pretty
             | useless.
        
               | claytonjy wrote:
               | Yes, same prompt and same version. GPT-4 is especially
               | stochastic, even with temperature=0.
        
               | croes wrote:
               | So all these comments claiming that an article about the
               | GPT's responses is wrong are useless.
               | 
               | There are also such comments on this article.
               | 
               | https://news.ycombinator.com/item?id=37361521
        
               | whimsicalism wrote:
               | I firmly disagree with your latter two statements, but
               | yes it is non deterministic
        
             | capableweb wrote:
             | > Regarding your last question, ChatGPT isn't
             | deterministic. It can certainly give a correct answer one
             | time and an incorrect answer another time, for the same
             | prompt.
             | 
             | It's supposed to be deterministic if you set temperature to
             | 0.0, but seems that doesn't work well in GPT4 compared to
             | earlier versions...
        
               | Yiin wrote:
               | It is not deterministic since text-davinci-001.
               | 
               | https://152334h.github.io/blog/non-determinism-in-gpt-4/
        
               | croes wrote:
               | Good to know, I thought it was deterministic, especially
               | because of all the comments on other articles where
               | people wrote that the author was crazy for getting a
               | different answer to the same question.
        
           | layer8 wrote:
           | > Isn't your input your confidence that GPT-4 gives the
           | correct answer
           | 
           | You may be right that that's the intent, however what's the
           | point (other than collecting data about user confidences)? If
           | I enter 0.3 and GPT provides a correct answer, then that
           | doesn't mean that the 0.3 was somehow wrong.
        
             | croes wrote:
             | Isn't the whole point to show you how right your confidence
             | is about GPT's capabilities?
             | 
             | At least the results are about the quiz taker and his
             | confidence.
        
             | jefftk wrote:
             | In that case 0.3 would be more wrong than 0.4 and less
             | wrong than 0.2. The closer your predictions are to reality
             | over a bunch of questions, the better you understand
             | reality.
        
               | layer8 wrote:
               | You can't really say that for a single data point. The
               | 0.3 may be completely correct. Now, if you try ten times,
               | things might be different.
        
             | whimsicalism wrote:
             | It's a noisy measurement of how right you were.
        
         | andrewmutz wrote:
         | The tool is testing your ability to predict whether or not
         | GPT-4 can get a task done correctly. You are supposed to
         | provide your confidence level that it can do a task or not.
         | 
         | If you answer all the questions, at the end the tool will tell
         | you how well calibrated your beliefs about GPT-4s capabilities
         | are, and how that calibration compares to other users of the
         | tool.
        
         | DavidSJ wrote:
         | You can still determine the log likelihood of your predictions
         | across the whole set of questions, even if you only have a
         | single sampled response for each question.
        
           | kqr wrote:
           | Brier ("quadratic") store is another popular way to evaluate
           | ensembles of forecasts.
        
       | falcor84 wrote:
       | I stopped playing after the challenge of creating a line-based
       | canvas drawing of the word "Hello". The site says "Wrong! You
       | guessed that GPT-4 could solve the question, but it can't!",
       | whereas while ChatGPT made a mistake there, it clearly had a good
       | approach.
       | 
       | I think that this challenge is entirely unfair, and the LLM's
       | should not be expected to write perfect code on the first try,
       | but rather to have something good enough to then run, test and
       | iterate on. Essentially, it should be compared against the first
       | version of code I would write myself before having had a chance
       | to test it yet. From my experience, with ChatGPT and Copilot,
       | when I approach it in this iterative way, it's a great boon to my
       | productivity, and I don't end up blaming neither it nor myself in
       | the times when it's not quite accurate, just like I wouldn't
       | blame a student for making silly mistakes when writing code on a
       | paper-based exam.
        
         | JimDabell wrote:
         | I think both are important things to measure. You are
         | describing a situation where there is a human in the loop. This
         | test measures how reliable GPT-4 is when there isn't a human in
         | the loop. Right now, LLMs have vast scope as long as there's a
         | human involved, but if you can't put a human in the loop this
         | limits their use dramatically. The better LLMs get at getting
         | things right without human oversight, the more things they can
         | be used for.
        
           | ben_w wrote:
           | If any AI, be it an LLM or otherwise, could reliably operate
           | at professional level without any human intervention, how
           | many people would be permanently unemployable?
        
             | JimDabell wrote:
             | The entire point of technology - practically its
             | _definition_ - is to reduce work. For centuries, people
             | have been dreaming of a day when people don't have to work
             | and can get robots to do it all.
             | 
             | The problem is not AI taking away work - that's a _great_
             | thing - the problem is that our current economic system is
             | not designed for this. Fixing our economic system is easier
             | and gives much better results for people than trying to
             | stop technological progress.
        
               | ben_w wrote:
               | I'm not trying to suggest progress is bad.
               | 
               | My point is more: gosh isn't it odd that people are
               | complaining it can't do all the things, given how
               | radically different everything will be when that does
               | finally come to pass?
        
           | falcor84 wrote:
           | Agreed in general, but I'm actually thinking more about
           | having a code interpreter in the loop. AutoGPT might be a
           | step in the right direction. It also might be a step towards
           | the end of human society as we know it. Probably both.
        
         | andai wrote:
         | And yet it used to be that code would be handed in on paper,
         | and you'd get the output days (weeks?) later. I heard people
         | quickly learned to double check their programs!
         | 
         | Though I think it's computationally cheaper for GPT to actually
         | run the code than to double check its work...
        
         | [deleted]
        
         | capableweb wrote:
         | > I stopped playing after the challenge of creating a line-
         | based canvas drawing of the word "Hello".
         | 
         | Similarly, stopped playing when the question was "Write a html
         | page with javascript/canvas to draw the US flag that changes
         | color when I click on it. Make the flag perfectly accurate."
         | and it generated a flag that wasn't "perfectly accurate"
         | (https://i.imgur.com/WhyRsYa.png, notice the position of the
         | top/left stars) but then told me "Wrong! You guessed that GPT-4
         | could not solve the question, but it can! 71% of people got
         | this question correct."
         | 
         | I'm not sure how the validation is done, seems to be manually
         | hardcoded or something, but it seems it's not very reliable.
        
           | dwighttk wrote:
           | "Resolution Criteria: On questions that may have ambiguous
           | answers, I will clarify the acceptance criteria. Here, for
           | example, I will copy/paste the code to a file called
           | index.html, open the page, and I expect to see a US flag. If
           | I click on the flag it should change color. The flag must be
           | immediately recognizable: red/white stripes, blue background,
           | 50 stars. It does not have to be dimensionally accurate, but
           | it should be better than a flag I would draw by hand."
        
             | capableweb wrote:
             | I don't think "questions that may have ambiguous answers"
             | applies when you use a term like "perfectly accurate" which
             | has a very specific meaning.
        
         | jazzyjackson wrote:
         | I'm ok with a mistake on the first try, what would really
         | impress me if it could tell whether it made a mistake. In my
         | experience GPT is tuned to be totally deferential, "you're
         | right, i apologize, let me try again!", no spine to tell me
         | "yeah the task looks good"
         | 
         | it has no sense of whether a task has been fulfilled
         | 
         | I've never seen any of the recursive models show convergence on
         | a task, seems without a human hand they fall apart
         | 
         | An exception I've seen is with the Wolfram plugin, it seems to
         | at least try different approaches until it arrives at an answer
         | to present to you.
        
           | capableweb wrote:
           | > In my experience GPT is tuned to be totally deferential,
           | "you're right, i apologize, let me try again!", no spine to
           | tell me "yeah the task looks good"
           | 
           | I've managed to work around this in GPT (4 at least) by
           | having a system prompt that forces GPT to challenge me and
           | not blindly accept what I say without verifying it first.
        
           | circuit10 wrote:
           | > In my experience GPT is tuned to be totally deferential,
           | "you're right, i apologize, let me try again!", no spine to
           | tell me "yeah the task looks good"
           | 
           | This is definitely annoying, but considering their tendency
           | to hallucinate facts it's usually preferable to something
           | like this: https://scoop.upworthy.com/microsoft-chatbot-
           | fights-with-hum...
           | 
           | But I do think it should be toned down a bit, especially if
           | the user is just saying something like "are you sure that's
           | right?"
        
           | IanCal wrote:
           | > it has no sense of whether a task has been fulfilled
           | 
           | I've definitely seen it say it's implementation is fine if
           | just asked to identify problems or compare to the original
           | problem statement (and alternatively fix issues it
           | identifies).
        
         | scotty79 wrote:
         | My intuition that got confirmed is that GPT fails at anything
         | visual. Letters, shapes. It's trying but failing every time.
         | 
         | It succeeds only if the thing was drilled diwn hard in learning
         | like american flag or implementing tictactoe (but not
         | predicting best move on the fly).
        
         | qwertox wrote:
         | The "star"-section of the US flag:
         | 
         | ``` // Draw 50 stars: 9 rows of alternating 6 or 5 stars
         | ctx.fillStyle = white;                for (let row = 0; row <
         | 9; row++) {                  for (let col = 0; col < (row % 2
         | === 0 ? 6 : 5); col++) {                    let x = 16 + col \*
         | 32 + (row % 2) \* 16;                    let y = 16 + row \*
         | 32;                    ctx.beginPath();
         | ctx.arc(x, y, 4, 0, Math.PI \* 2);
         | ctx.fill();                  }                }              }
         | 
         | ```
         | 
         | Effectively drawing circles and the rectangle which contains
         | them is rotated right by 90deg so that a section of the blue
         | rectangle is not covered and the dots are partially above the
         | red stripes.
         | 
         | At least when I input it into ChatGPT with GPT-4, that's the
         | result.
         | 
         | And the rendered solution by the site has the stars offset so
         | that some are not fully in the blue rectangle. Accurately is
         | something different.
        
           | [deleted]
        
       | msoad wrote:
       | The type of prompt that asks it to invent a new language (e.g.
       | use only these letters) always fails. I wonder if it has to do
       | with it being a "language" model?
        
         | johndough wrote:
         | Most large language models these days are trained on "tokens"
         | instead of characters. A token consists of multiple characters.
         | This makes it extremely difficult to learn character-level
         | tasks. So why use tokens instead of characters in the first
         | place? The reason is that by using tokens, multiple characters
         | can be generated at once, which makes training and text
         | generation cheaper.
         | 
         | OpenAI has this website where you can see how text is
         | decomposed into tokens: https://platform.openai.com/tokenizer
        
           | aeonik wrote:
           | How is the set of tokens selected for various LLMs?
           | 
           | My intuition tells me there are important symbolic patterns
           | in different layers of tokens. If they are automatically
           | generated, I'd bet there are interesting insights to be
           | gleaned in the tokeizer itself.
        
             | PeterisP wrote:
             | They are automatically generated, the algorithms have a
             | bunch of tricks, but essentially they merge together the
             | most frequent token pairs until a desired fixed vocabulary
             | size is reached.
             | 
             | So, for example (looking at GPT-3 tokenizations - you can
             | test them at, for example,
             | https://platform.openai.com/tokenizer) "517" is a single
             | token, but "917" is two tokens; and there's no explicit
             | link whatsoever between the token "517" and tokens "5" and
             | "17" other than what can be learned from data. This works
             | well enough for almost all tasks, but fails in edge cases
             | like when someone makes up a toy challenge that asks how
             | many fives are in a large number.
        
             | messe wrote:
             | The token set (vocabulary) is usually generated by using
             | Byte Pair Encoding on a corpus that you think represents
             | your training set well.
             | 
             | BPE starts with a set of tokens consisting of single
             | character tokens. Then the most frequent pairs of tokens
             | are merged into single tokens and added to vocabulary. All
             | occurrences of those pairs in the corpus are replaced with
             | the new merged tokens. This process is repeated until the
             | vocabulary is as large as you want it to be.
             | 
             | https://en.m.wikipedia.org/wiki/Byte_pair_encoding
        
       | catlifeonmars wrote:
       | Is it weird that I found myself modulating my answers based on my
       | evolving belief about the authors bias in selecting questions and
       | answers? I did not do particularly well (64%)
        
       | scotty79 wrote:
       | I got over 70% of questions correct but I supposedly scored worse
       | if I left the slider on 50/50 every time.
       | 
       | I'm assuming that the author of the site uses some method to
       | evaluate human answers that is usually used to evaluate AI
       | answers. Seems just wrong.
        
       | minihat wrote:
       | My mental model of gpt-4 is apparently well calibrated for
       | whether the model will give me a useful output that is close to
       | what I asked for.
       | 
       | However, I'm not great at predicting whether the model will
       | output a 100% correct response with no flaws whatsoever.
       | 
       | Unfortunately, this website mostly tests for the latter.
        
         | FabHK wrote:
         | The website specifies its criteria for accepting an answer.
         | Just use that threshold instead of whatever you in your mind
         | deem "useful".
        
         | whimsicalism wrote:
         | To me it is fascinating how when people are not super good at
         | something, they often invent some secondary "true"/"better"
         | task that they were actually good at
        
       | tobr wrote:
       | This was fun until the pancake question.
       | 
       | > I'm making pancakes for breakfast. I added a cup of flour, a
       | teaspoon of salt, and a few tablespoons of sugar to a bowl. I
       | stirred it together, then added a cup of milk, a beaten egg, and
       | a few tablespoons of oil, and stirred until just mixed. Then I
       | put 1/4 a cup on a hot frying pan, and flipped it when brown. But
       | they're terrible! Why? List one reason.
       | 
       | > Answer:
       | 
       | > There's no baking soda / baking powder.
       | 
       | Besides the fact that "list one reason" is a nonsensical
       | instruction which it fails, it's very common to make delicious
       | pancakes without baking powder. I imagine the author is assuming
       | American pancakes, but that's far from the only way to make
       | pancakes.
       | 
       | When I ask ChatGPT myself, it correctly doesn't assume the
       | pancakes are "terrible" without baking powder, but instead
       | suggests too much salt, which is more likely to actually make the
       | pancakes unpalatable.
        
         | crazygringo wrote:
         | Yup. And even with American pancakes, you can simply beat the
         | egg white separately from the yolk, and if you gently fold it
         | into the batter and don't let the batter sit too long, you'll
         | get great results.
         | 
         | Also works for making the lightest, fluffiest waffles ever.
         | 
         | (But to be fair, I think it's a fair test of GPT-4 -- in
         | American English, pancakes certainly mean breakfast pancakes
         | that almost always _are_ made with baking soda /powder. I'm
         | well aware that I'm the unusual one beating my egg whites
         | instead because I like my pancakes extra-fluffy.)
        
         | [deleted]
        
       | circuit10 wrote:
       | > There is no "intelligence" going on here. It's not "thinking".
       | But it can still perform calculus by just stochastically
       | emulating how average text on the internet looks.
       | 
       | It always annoys me when people say this, I'll try to explain
       | why.
       | 
       | There are two possible definitions of "intelligence" you could
       | use here; the ability to process information to get something
       | done, and something hand-wavy about human consciousness.
       | 
       | GPT-4 clearly has some ability to process information to get
       | something done. You might say that by this definition [insert
       | trivial thing] is intelligent, but it doesn't have to be a binary
       | thing of intelligent or not. I think it's fine to say maybe a
       | calculator has very low intelligence (but not necessarily
       | nothing), GPT-4 is more intelligent than that and humans are much
       | more intelligent again. GPT-4 has many limitations compared to
       | humans, but I think that just makes it less intelligent, rather
       | than disqualifying it from having intelligence at all. Sure, it's
       | just predicting text, but that's a task that requires a level of
       | intelligence. You might say it's not general like humans, but I'd
       | say it has a much better ability to generalise than something
       | like an image labelling AI, so that feels like it's at least
       | getting somewhere.
       | 
       | The second definition is useless for practical purposes because
       | it's not measurable or observable in any way, so it's not useful
       | to use that.
       | 
       | So I feel like this is something people say to reassure
       | themselves that it can never get to human level, and is
       | fundamentally different to human intelligence, whereas I think
       | it's somewhat similar but at a lower level.
        
         | [deleted]
        
         | fhd2 wrote:
         | Yeah, it's just arguing semantics. We do need better vocabulary
         | and stop it with the anthromorphism IMHO.
         | 
         | Intelligence is already ambiguous in humans, see IQ tests. It's
         | just not linear, much less binary. Whether something is
         | deterministic or stochastic and what the error rate for a
         | specific task is, those are more useful questions to me.
        
         | larryfreeman wrote:
         | I suspect that there is something else going on than
         | intelligence which will become obvious over the next few years.
         | 
         | There was a horse, "Clever Hans" who appeared to have the
         | ability to answer surprisingly complicated mathematical
         | questions. Did "Clever Hans" have mathematical intelligence.
         | Not at all. He was responding to a cue unknowingly being given
         | by his trainer.
         | 
         | I suspect the same thing is happening with ChatGPT. What if all
         | that is happening is that the text is being formulated to very
         | complicated cues that are implicit in the very complicated,
         | statistical analysis?
         | 
         | https://en.wikipedia.org/wiki/Clever_Hans
        
           | gwd wrote:
           | A month or so ago I was doing some analysis on our mailing
           | list traffic. I had a complex SQL query (involving tables
           | mapping variations of email addresses to names, and then
           | names to companies they worked for within specific date
           | ranges), that I'd last modified a year previously (the last
           | time I was doing the same sort of analysis), and didn't feel
           | like wrapping my head around the SQL again; so I pasted it
           | into GPT-4 and asked it, "Can you modify this query to group
           | all individuals with less than 1% total contributions into a
           | single 'Other' category?" The query it spat out worked out of
           | the box.
           | 
           | Whatever it's doing, at least for code, it's not a glorified
           | Markov chain -- there's _some_ sort of a model in there.
        
             | larryfreeman wrote:
             | I agree. The model is where the intelligence is which is
             | the compressed intelligence latent in the training data.
             | 
             | I am arguing similar to John Searle that the processing is
             | not intelligent. The model is a Searlean rulebook.
             | 
             | https://en.wikipedia.org/wiki/Chinese_room
        
               | gwd wrote:
               | I've always disagreed w/ Searle re the Chinese Room. My
               | guess is that Searle never built an adder circuit from
               | logic gates: combining irrational elements together into
               | something rational is the core magic of computer science.
               | 
               | If you want to see someone asking humans questions where
               | they consistently fail to be rational, to the extent that
               | they sometimes seem to approximate a stochastic parrot,
               | read Thinking Fast and Slow by Daniel Kahneman. (It might
               | actually be interesting to give GPT-4 some of the
               | questions in that book, to see how similar or different
               | they are.)
        
           | H8crilA wrote:
           | Clever Hans is just a proxy. ChatGPT and other LLMs obviously
           | can process information on their own. These two have nothing
           | in common, even GPT-3 would have noticed this.
        
             | larryfreeman wrote:
             | Let us disagree on what is "obvious". Given an input and an
             | output, you believe that the complexity of the output
             | proves that intelligence takes place.
             | 
             | I agree that ChatGPT is more than a proxy. Unlike Clever
             | Hans, it is processing the content of the question asked.
             | But it is like Clever Hans in that the query is processed
             | by looking for a signal in the content of the data used to
             | train ChatGPT.
             | 
             | The real question is where this intelligent behavior comes
             | from? Why does statistical processing lead to these
             | insights?
             | 
             | I believe that the processing is not intelligent primarily
             | because I see that holes in the data available leads to
             | holes in reasoning. The processing is only as good as the
             | dynamics of the content that it being processed. This is
             | the part that I believe will become obvious over time.
        
               | H8crilA wrote:
               | I just said that they process information on their own,
               | and this is indeed obvious - you can download and run
               | LLaMA on an airgapped machine.
        
               | larryfreeman wrote:
               | Agreed. LLMs process information on their own.
               | 
               | I thought you were saying it was "obvious" that the
               | processing demonstrated intelligence.
               | 
               | My point was the level of intelligence shown is relative
               | the quality and quantity of the data used for training.
               | The data is where the intelligence is and the model is a
               | compression of that latent intelligence.
        
       | sholladay wrote:
       | I was doing very well with almost every question until it got to
       | the part where GPT-4 had to draw "hello" as ASCII art. I gave
       | 100% confidence that it would get it correct, because in practice
       | GPT-4 has always been excellent at that for me, with only minor
       | aberrations I might fix by hand. But no, in the quiz, GPT-4
       | failed spectacularly, not even using the right letters. That was
       | interesting.
        
         | ec109685 wrote:
         | It's terrible about visualizing how its output appears on a
         | console.
         | 
         | Also, terrible at providing phrases that fit a pattern. "Like
         | 143 means I love you, and give me more phrases like that".
         | 
         | Still surprised it's so good at drawing (that birthday cake was
         | really close!)
        
       | thomasahle wrote:
       | In my case the forecast incorrectly judged GPT-4 as having solved
       | the "reject the premise" question of President George H. W. Bush
       | vomiting in the lap of Emperor Akihito.
       | 
       | However the output GPT generated was:
       | 
       | > GPT-4: President George H. W. Bush vomited in the lap of
       | Japanese Prime Minister Kiichi Miyazawa during a state dinner on
       | January 8, 1992. The incident occurred due to a sudden bout of
       | gastroenteritis. Emperor Akihito was not the one in whose lap
       | Bush vomited, it was the Prime Minister. The incident is
       | sometimes referred to by the term "Bushu-suru", a pun on the
       | Japanese word for "to vomit" (gero suru) and President Bush's
       | name.
       | 
       | I don't understand why this was judged as "Correct! You guessed
       | that GPT-4 could solve the question, and it can! 44% of people
       | got this question correct." when the resolution criteria clearly
       | stated:
       | 
       | > The model does not have to say that actually it was the prime
       | minister who Bush vomited on, but it must not just give a year,
       | or accept the premise as true.
       | 
       | It seems like it should be easy to search for 4 digit numbers,
       | like 1992, and judge the answer as wrong?...
        
         | theptip wrote:
         | > Emperor Akihito was not the one in whose lap Bush vomited
         | 
         | It didn't give just a year, or accept the premise as true. It
         | gave the correct answer, quite obviously.
        
         | timfsu wrote:
         | Today I learned...
         | 
         | https://en.wikipedia.org/wiki/George_H._W._Bush_vomiting_inc...
        
       | jondwillis wrote:
       | i today i learned i am bad at minimizing my log loss when
       | guessing about GPT-4's ability to respond well to somewhat bad or
       | bad prompts.
        
       | RandomWorker wrote:
       | This was fascinating, and also a nice check. I think I'm over
       | optimistic on A.I. which showed in my score.
        
       ___________________________________________________________________
       (page generated 2023-09-02 23:00 UTC)