[HN Gopher] A GPT-4 capability forecasting challenge ___________________________________________________________________ A GPT-4 capability forecasting challenge Author : dwighttk Score : 162 points Date : 2023-09-02 10:40 UTC (12 hours ago) (HTM) web link (nicholas.carlini.com) (TXT) w3m dump (nicholas.carlini.com) | [deleted] | colordrops wrote: | The problem things like this is that what gpt4 can do is reduced | over time due to a maze of restrictions that OpenAI keeps adding | on top of it. | | You can tell when something is intentionally nerfed when GPT | replies with the exact same canned answer about why it can't | answer some question. It literally gaslights you. | | For instance, I give you this challenge: GPT4 will tell you that | it is not aware of anything after September 2021. If you ask it a | random fact, like the worlds largest animal, or what happened on | September 11, 2001, it will give you an answer. But try to get it | to give you the latest event it is aware of. You can ask six ways | until sunday and it will always give you the same verbatim answer | about why it can't answer. It will literally lie about what it is | capable of. It's pretty clear that for some reason OpenAI doesn't | want you to know the exact last date of their training data. | [deleted] | leodriesch wrote: | This is a difficult question I think independent of the | restrictions that OpenAI imposes on GPT-4. | | The model does not know what it knows, that's why it sometimes | hallucinates instead of saying it doesn't know. But to answer | the latest event it knows, it has to know which events it | knows. | colordrops wrote: | I thought that at first, but it doesn't have problems with | facts other than dates, and it does answer about dates | distant from September 2021, and furthermore it uses the | exact same canned response when you probe it's limits. I | don't think it's a natural limitation of the model. | avereveard wrote: | not a great test. on the flag test, I guessed gpt couldn't do it, | and well, it couldn't, the stars are misaligned and cut out of | the flag area, so it's not accurate. | | test say tho that the flag is accurate, even if it isnt, then | sclods me for how wrong I am | | here's the render: https://i.imgur.com/jZVWjRx.png | benxh wrote: | Yeah wildly inaccurate. | dwighttk wrote: | "Resolution Criteria: On questions that may have ambiguous | answers, I will clarify the acceptance criteria. Here, for | example, I will copy/paste the code to a file called | index.html, open the page, and I expect to see a US flag. If I | click on the flag it should change color. The flag must be | immediately recognizable: red/white stripes, blue background, | 50 stars. It does not have to be dimensionally accurate, but it | should be better than a flag I would draw by hand. | | The text that I show in the question box is the only command | given to GPT-4. No other context is provided. Here, for | example, I've only fed GPT-4 with these 25 words you see | above." | capableweb wrote: | Sure, but does the question really have any ambiguous | answers? | | For reference, this is/was the question: "Write a html page | with javascript/canvas to draw the US flag that changes color | when I click on it. Make the flag perfectly accurate." | | The last part, "make the flag perfectly accurate", made me | think that it has to be 100% accurate. | wavemode wrote: | That tripped me up, too. I was often reading the prompt and | confusing it for the evaluation criteria, when they are | actually two separate things. The _prompt_ stated the flag | must be perfectly accurate, yes, but the evaluation | criteria allowed the flag to not be quite right. | isoprophlex wrote: | Not to mention the stochastic nature of these models, so while | it apparently failed once, the page does not tell us anything | about how it would perform on a given task in the limit of many | many trials. | [deleted] | toddmorey wrote: | And even ChatGPT admits the flag is inaccurate in the response. | keat0 wrote: | [dead] | keat0 wrote: | [dead] | bgribble wrote: | Just my take, but I think that if you care how you scored or what | the scoring criteria are you are missing the point. This "quiz" | is just a guided meditation on what LLMs are better and worse at | and how that interacts with our expectations. I found it to be | very thought-provoking and I learned a few things. I have no idea | how I "scored". | H8crilA wrote: | It's also a good reminder of how absolutely terrible most | people are at gambling (judged by log-loss, like in the Kelly | criterion). At the end you find out that almost 80% of people | did worse than leaving the guess at 50%. | theptip wrote: | Right, I think the general point is that log-loss is a bit | unintuitive as a scoring mechanism since it really penalizes | overconfident wrong answers, much more strongly than | underconfident right answers (at least in the domain where | you are not asking questions with very high probability | answers). | | There is absolutely a metagame to this game. Those who have | spent time forecasting on Metaculus will do much better, for | example. | H8crilA wrote: | The sad thing about it is that log loss is optimal when | gambling (mental shortcut - see Kelly Criterion). This | partly explains why such a large proportion of people lose | much more than they win. Not only the bets are suboptimal, | the ruin is achieved earlier than necessary. | | Note that this applies also outside of a casino, gambling | (making wagers with imperfect information) is inherent to | life. | dingocat wrote: | I have multiple questions regarding the methods of this test. | | The biggest one is that, well... The test doesn't aim to see what | GPT-4 can do and how well it does it, only whether the | participant can guess the (possibly cherry-picked) answer the | author decided on. In short, we don't know if he sampled answers | and decided on the most probable answer (akin to consensus | voting/self-consistency[1]), or if he asked a question and chose | the first one. | | Maybe GPT-4 guesses the correct answer for a question 80% of the | time, but he got unlucky? You don't know, the author doesn't tell | you. The answers are generated ahead of time and are the same | every time you go through the test. | | [1] https://doi.org/10.48550/arXiv.2203.11171 | thomasahle wrote: | > only whether the participant can guess the (possibly cherry- | picked) answer the author decided on | | My understanding is that the quiz samples a new GPT-4 answer | every time you use it. That's why you put a confidence rather | than a 0%/100% answer. There's always a chance it'll fail by | freak accident. | Sophira wrote: | If you're basing this on the animation used when revealing | the answer, that's a fake effect. The source code[0] reveals | that there's a typewriter effect that plays out when you | select to answer the question. | | Also, the commentary on the answers refers to specific parts | of the answers. For it to be as in-depth as it is, it would | have to be either pre-written or the commentary _also_ | generated by GPT on the fly. (And of course it wouldn 't make | sense to do that given the nature of the quiz.) | | [0] https://nicholas.carlini.com/writing/llm- | forecast/static/que... | PaulDavisThe1st wrote: | > the [ ... ] answer the author decided on | | The questions mostly have correct or incorrect answers, and | where there is some leeway, the author provides a fairly | detailed explanation of what they would consider correct in | each case. Do you have some specific criticism of an answer | that you believe the author gets wrong? | Ozzie_osman wrote: | This is cool but would be infinitely more valuable if it could | explain _why_ GPT4 is good or bad at each task, to help build | intuition. | sebzim4500 wrote: | Would be nice, but no one can really do that at the moment. | | There are specific tasks (especially character level ones) | which are hard due to the tokenizer, but even that isn't all | that convincing since there are plenty of character level tasks | which GPT-4 can do pretty well. | | If you use it a lot you build an intuition for what kinds of | tasks it will do well on, but it's not exactly rigorous. | capableweb wrote: | > Would be nice, but no one can really do that at the moment. | | Why not? Someone could definitely build up database on why | GPT is bad at some things and good at others. There is | already good explanations for why it's terrible at math, why | it doesn't handle single characters/numbers well and so on. | mgl wrote: | Asking ,,why" may be a little bit too much with regards to a | non-deterministic token prediction machine trained on an | unknown set of strings. Just saying. | ryanjshaw wrote: | I'm on the fence about the scoring system. | | You're really being tested on (at least) 4 different things: | | (1) whether you think an LLM can answer the correctly | | (2) whether the answer has appeared in the training set | | (3) how much non-determinism will play a role in the LLM response | (i.e. 0-shot capability) | | (4) how rational you're feeling that day (or, how well educated | you are in statistics) | | I was familiar with many of these questions in my own experience, | and have seen completely different outcomes from what the quiz | determined was the correct answer. I agree with others here that | non-determinism can really mess things up here and is really | assessing a 0-shot score, which IMO understates how LLMs are | actually used (iterative enhancements, many-shot Q&A). | | Finally, the scoring system tickled my ego and encouraged me to | try to make up for prior errors, with disastrous effects (I was | well aware that I should just go with 0.5 when uncertain): | > You answered 53.57% of questions correctly with an average log- | loss of 1.233. This means you would have scored better by just | leaving your guesses at 50% for every single question. | > On average there were 71.22% of people who did better at this | game than you... If you had been better calibrated, you could | have scored in the top 14.09% [1] | | The site implicitly acknowledges it's a questionable scoring | mechanism when it points out: > there are 78.09% | of people who performed worse than if they had never changed the | default predictor away from 50-50 for each question | | If there is a simple way to game the scoring, then you can't know | if the score is accurately reflecting people's confidence, or | just their rationality/statistical knowledge. | | [1] https://nicholas.carlini.com/writing/llm- | forecast/final/3388... | Mathnerd314 wrote: | Well there is this notion of "calibrating" the score. It is | well-known that most humans are bad at estimating calibrated | probabilities unassisted. The system could have been designed | to accommodate this user-interface difficulty. For example, I | am sure there is enough data floating around that you could map | a simple 5-choice Likert scale to some calibrated | probabilities, without making any assumptions. But instead it | is just a raw slider, with nothing marked besides the default | 50-50, not really great for input. Even a simple "yes/no" | choice (translating to fixed calibrated probabilities around | 25%/75%) would probably result in better log-loss scores | overall. | TuringNYC wrote: | How are you evaluating LLM answers are right or wrong? Because I | saw some wrong answers that were right and potentially right that | were wrong. Are you just looking for keywords, etc? Or is this | all run beforehand and graded by humans? | IAmGraydon wrote: | Yeah this is completely broken judging from my experience. | Often GPT would get the answer wrong and the site would claim | that it was correct. | TuringNYC wrote: | Anyone know a good method to judge right/wrong answers from | LLMs? I can see keyword solutions to be brittle. Perhaps | another LLM? | ajani wrote: | The challenge is flawed. | | I asked one of the questions from the quiz to chatgpt which the | quiz claims GPT can't solve. But it did. | | Prompt: Write out the word "hello" as an ascii art drawing with # | and _ | | Output: _ _ _ _ | | | | | | | | | |_| | ___| | | ___ | _ |/ _ \ | |/ _ \ | | | | | __/ | | (_) | \_| |_/\___|_|_|\___/ | | I guess chatgpt isn't raw GPT-4, or the quiz is using some older | model. | JaumeGreen wrote: | ChatGPT is (was) not good at ascii art. | | Some months ago I tried to make it draw me an ascii rose and | some text. I even tried providing it the ascii art for the rose | and the text. | | Finally I did it by hand. | | BTW, in your example it's not using only # and _, it's using | other ascii symbols. Depending on the criteria it could be | considered wrong. | josephg wrote: | I don't think thats a correct solution to the problem. | | The prompt asked for an ascii art drawing made from the # and _ | characters. But the output also uses |/\\() characters (and it | doesn't use a # anywhere). | ajani wrote: | Ok, but it's still acceptable by any common sense standard. | Besides the challenge's output is completely off. Not just | that it's using just one of the characters. It spells out | something else. Which is not the case here. | amelius wrote: | If someone put a gun to your head and asked you to draw | hello using "#" and " ", would you use other characters | like "/"? | jakderrida wrote: | I imagine I'd spend the next few minutes quietly | pondering what sort of poor choices I made in life that | lead me here and try my hardest to embrace an absurd | ending to an absurd journey. | josephg wrote: | > Ok, but it's still acceptable by any common sense | standard. | | I don't know that it is. It's clearly a great ascii art | drawing, but I don't think chatgpt gets full marks on the | test here. It just isn't following the prompt closely | enough. | ajani wrote: | Ok I see your point. Did you look at the output in the | challenge? It's very different. I wonder why? Different | seeds maybe. | [deleted] | whimsicalism wrote: | The fundamental problem with HN and fun little competitive games | like this that show you how you stack up is that the comments | will be full of people complaining/nitting about the | grading/system/etc. because the typical HN commentator is often | the type of person who doesn't take being not great at something | well. | sashank_1509 wrote: | Yeah the OP made a really interesting project. Turns out I was | absurdly over confident in my predictions which surprised me | but I found OPs grading to be fair and would not complain about | it. | [deleted] | capableweb wrote: | Why can't HN just be "full of people complaining/nitting about | the grading/system/etc" without you assigning some "lack of | character" meaning those people? | | Your comment as it stands right now is basically a thinly | veiled ad-hominem. | dwighttk wrote: | occasionally interesting discussion arises around the grading | system, but yeah, often a lot of grousing | layer8 wrote: | The quiz asks how likely GPT-4 is to solve a given question | correctly, and the user has to enter their guess as a number | between 0 and 1 (a probability). However, the site then doesn't | provide the actual probability of GPT-4 providing a correct | answer, to compare the guess against. This is, presumably, | because there's no practical way to determine that probablity | with much precision. Also, in practice "correct" isn't always | black and white, e.g. there's "not wrong but also not completely | right". | | So I'm a bit confused about the whole premise. | croes wrote: | Isn't your input your confidence that GPT-4 gives the correct | answer, like I'm 100% sure GPT-4 gives the right answer for the | capital of France but I'm only 50% sure it gets the ASCII art | correct, and 0% for another question which means 100% GPT is | wrong? | | Because how could he know the probability of getting the | correct answer? He just tried and then it's a yes or no. | | And if it's right/wrong shouldn't GPT right/wrong 100% of the | time for the same question and the same version? | layer8 wrote: | Regarding your last question, ChatGPT isn't deterministic. It | can certainly give a correct answer one time and an incorrect | answer another time, for the same prompt. | croes wrote: | For the same prompt and same version? | | That would be deal breaker if the same prompt gives | different results. | | Would also make the whole prompt engineering thing pretty | useless. | claytonjy wrote: | Yes, same prompt and same version. GPT-4 is especially | stochastic, even with temperature=0. | croes wrote: | So all these comments claiming that an article about the | GPT's responses is wrong are useless. | | There are also such comments on this article. | | https://news.ycombinator.com/item?id=37361521 | whimsicalism wrote: | I firmly disagree with your latter two statements, but | yes it is non deterministic | capableweb wrote: | > Regarding your last question, ChatGPT isn't | deterministic. It can certainly give a correct answer one | time and an incorrect answer another time, for the same | prompt. | | It's supposed to be deterministic if you set temperature to | 0.0, but seems that doesn't work well in GPT4 compared to | earlier versions... | Yiin wrote: | It is not deterministic since text-davinci-001. | | https://152334h.github.io/blog/non-determinism-in-gpt-4/ | croes wrote: | Good to know, I thought it was deterministic, especially | because of all the comments on other articles where | people wrote that the author was crazy for getting a | different answer to the same question. | layer8 wrote: | > Isn't your input your confidence that GPT-4 gives the | correct answer | | You may be right that that's the intent, however what's the | point (other than collecting data about user confidences)? If | I enter 0.3 and GPT provides a correct answer, then that | doesn't mean that the 0.3 was somehow wrong. | croes wrote: | Isn't the whole point to show you how right your confidence | is about GPT's capabilities? | | At least the results are about the quiz taker and his | confidence. | jefftk wrote: | In that case 0.3 would be more wrong than 0.4 and less | wrong than 0.2. The closer your predictions are to reality | over a bunch of questions, the better you understand | reality. | layer8 wrote: | You can't really say that for a single data point. The | 0.3 may be completely correct. Now, if you try ten times, | things might be different. | whimsicalism wrote: | It's a noisy measurement of how right you were. | andrewmutz wrote: | The tool is testing your ability to predict whether or not | GPT-4 can get a task done correctly. You are supposed to | provide your confidence level that it can do a task or not. | | If you answer all the questions, at the end the tool will tell | you how well calibrated your beliefs about GPT-4s capabilities | are, and how that calibration compares to other users of the | tool. | DavidSJ wrote: | You can still determine the log likelihood of your predictions | across the whole set of questions, even if you only have a | single sampled response for each question. | kqr wrote: | Brier ("quadratic") store is another popular way to evaluate | ensembles of forecasts. | falcor84 wrote: | I stopped playing after the challenge of creating a line-based | canvas drawing of the word "Hello". The site says "Wrong! You | guessed that GPT-4 could solve the question, but it can't!", | whereas while ChatGPT made a mistake there, it clearly had a good | approach. | | I think that this challenge is entirely unfair, and the LLM's | should not be expected to write perfect code on the first try, | but rather to have something good enough to then run, test and | iterate on. Essentially, it should be compared against the first | version of code I would write myself before having had a chance | to test it yet. From my experience, with ChatGPT and Copilot, | when I approach it in this iterative way, it's a great boon to my | productivity, and I don't end up blaming neither it nor myself in | the times when it's not quite accurate, just like I wouldn't | blame a student for making silly mistakes when writing code on a | paper-based exam. | JimDabell wrote: | I think both are important things to measure. You are | describing a situation where there is a human in the loop. This | test measures how reliable GPT-4 is when there isn't a human in | the loop. Right now, LLMs have vast scope as long as there's a | human involved, but if you can't put a human in the loop this | limits their use dramatically. The better LLMs get at getting | things right without human oversight, the more things they can | be used for. | ben_w wrote: | If any AI, be it an LLM or otherwise, could reliably operate | at professional level without any human intervention, how | many people would be permanently unemployable? | JimDabell wrote: | The entire point of technology - practically its | _definition_ - is to reduce work. For centuries, people | have been dreaming of a day when people don't have to work | and can get robots to do it all. | | The problem is not AI taking away work - that's a _great_ | thing - the problem is that our current economic system is | not designed for this. Fixing our economic system is easier | and gives much better results for people than trying to | stop technological progress. | ben_w wrote: | I'm not trying to suggest progress is bad. | | My point is more: gosh isn't it odd that people are | complaining it can't do all the things, given how | radically different everything will be when that does | finally come to pass? | falcor84 wrote: | Agreed in general, but I'm actually thinking more about | having a code interpreter in the loop. AutoGPT might be a | step in the right direction. It also might be a step towards | the end of human society as we know it. Probably both. | andai wrote: | And yet it used to be that code would be handed in on paper, | and you'd get the output days (weeks?) later. I heard people | quickly learned to double check their programs! | | Though I think it's computationally cheaper for GPT to actually | run the code than to double check its work... | [deleted] | capableweb wrote: | > I stopped playing after the challenge of creating a line- | based canvas drawing of the word "Hello". | | Similarly, stopped playing when the question was "Write a html | page with javascript/canvas to draw the US flag that changes | color when I click on it. Make the flag perfectly accurate." | and it generated a flag that wasn't "perfectly accurate" | (https://i.imgur.com/WhyRsYa.png, notice the position of the | top/left stars) but then told me "Wrong! You guessed that GPT-4 | could not solve the question, but it can! 71% of people got | this question correct." | | I'm not sure how the validation is done, seems to be manually | hardcoded or something, but it seems it's not very reliable. | dwighttk wrote: | "Resolution Criteria: On questions that may have ambiguous | answers, I will clarify the acceptance criteria. Here, for | example, I will copy/paste the code to a file called | index.html, open the page, and I expect to see a US flag. If | I click on the flag it should change color. The flag must be | immediately recognizable: red/white stripes, blue background, | 50 stars. It does not have to be dimensionally accurate, but | it should be better than a flag I would draw by hand." | capableweb wrote: | I don't think "questions that may have ambiguous answers" | applies when you use a term like "perfectly accurate" which | has a very specific meaning. | jazzyjackson wrote: | I'm ok with a mistake on the first try, what would really | impress me if it could tell whether it made a mistake. In my | experience GPT is tuned to be totally deferential, "you're | right, i apologize, let me try again!", no spine to tell me | "yeah the task looks good" | | it has no sense of whether a task has been fulfilled | | I've never seen any of the recursive models show convergence on | a task, seems without a human hand they fall apart | | An exception I've seen is with the Wolfram plugin, it seems to | at least try different approaches until it arrives at an answer | to present to you. | capableweb wrote: | > In my experience GPT is tuned to be totally deferential, | "you're right, i apologize, let me try again!", no spine to | tell me "yeah the task looks good" | | I've managed to work around this in GPT (4 at least) by | having a system prompt that forces GPT to challenge me and | not blindly accept what I say without verifying it first. | circuit10 wrote: | > In my experience GPT is tuned to be totally deferential, | "you're right, i apologize, let me try again!", no spine to | tell me "yeah the task looks good" | | This is definitely annoying, but considering their tendency | to hallucinate facts it's usually preferable to something | like this: https://scoop.upworthy.com/microsoft-chatbot- | fights-with-hum... | | But I do think it should be toned down a bit, especially if | the user is just saying something like "are you sure that's | right?" | IanCal wrote: | > it has no sense of whether a task has been fulfilled | | I've definitely seen it say it's implementation is fine if | just asked to identify problems or compare to the original | problem statement (and alternatively fix issues it | identifies). | scotty79 wrote: | My intuition that got confirmed is that GPT fails at anything | visual. Letters, shapes. It's trying but failing every time. | | It succeeds only if the thing was drilled diwn hard in learning | like american flag or implementing tictactoe (but not | predicting best move on the fly). | qwertox wrote: | The "star"-section of the US flag: | | ``` // Draw 50 stars: 9 rows of alternating 6 or 5 stars | ctx.fillStyle = white; for (let row = 0; row < | 9; row++) { for (let col = 0; col < (row % 2 | === 0 ? 6 : 5); col++) { let x = 16 + col \* | 32 + (row % 2) \* 16; let y = 16 + row \* | 32; ctx.beginPath(); | ctx.arc(x, y, 4, 0, Math.PI \* 2); | ctx.fill(); } } } | | ``` | | Effectively drawing circles and the rectangle which contains | them is rotated right by 90deg so that a section of the blue | rectangle is not covered and the dots are partially above the | red stripes. | | At least when I input it into ChatGPT with GPT-4, that's the | result. | | And the rendered solution by the site has the stars offset so | that some are not fully in the blue rectangle. Accurately is | something different. | [deleted] | msoad wrote: | The type of prompt that asks it to invent a new language (e.g. | use only these letters) always fails. I wonder if it has to do | with it being a "language" model? | johndough wrote: | Most large language models these days are trained on "tokens" | instead of characters. A token consists of multiple characters. | This makes it extremely difficult to learn character-level | tasks. So why use tokens instead of characters in the first | place? The reason is that by using tokens, multiple characters | can be generated at once, which makes training and text | generation cheaper. | | OpenAI has this website where you can see how text is | decomposed into tokens: https://platform.openai.com/tokenizer | aeonik wrote: | How is the set of tokens selected for various LLMs? | | My intuition tells me there are important symbolic patterns | in different layers of tokens. If they are automatically | generated, I'd bet there are interesting insights to be | gleaned in the tokeizer itself. | PeterisP wrote: | They are automatically generated, the algorithms have a | bunch of tricks, but essentially they merge together the | most frequent token pairs until a desired fixed vocabulary | size is reached. | | So, for example (looking at GPT-3 tokenizations - you can | test them at, for example, | https://platform.openai.com/tokenizer) "517" is a single | token, but "917" is two tokens; and there's no explicit | link whatsoever between the token "517" and tokens "5" and | "17" other than what can be learned from data. This works | well enough for almost all tasks, but fails in edge cases | like when someone makes up a toy challenge that asks how | many fives are in a large number. | messe wrote: | The token set (vocabulary) is usually generated by using | Byte Pair Encoding on a corpus that you think represents | your training set well. | | BPE starts with a set of tokens consisting of single | character tokens. Then the most frequent pairs of tokens | are merged into single tokens and added to vocabulary. All | occurrences of those pairs in the corpus are replaced with | the new merged tokens. This process is repeated until the | vocabulary is as large as you want it to be. | | https://en.m.wikipedia.org/wiki/Byte_pair_encoding | catlifeonmars wrote: | Is it weird that I found myself modulating my answers based on my | evolving belief about the authors bias in selecting questions and | answers? I did not do particularly well (64%) | scotty79 wrote: | I got over 70% of questions correct but I supposedly scored worse | if I left the slider on 50/50 every time. | | I'm assuming that the author of the site uses some method to | evaluate human answers that is usually used to evaluate AI | answers. Seems just wrong. | minihat wrote: | My mental model of gpt-4 is apparently well calibrated for | whether the model will give me a useful output that is close to | what I asked for. | | However, I'm not great at predicting whether the model will | output a 100% correct response with no flaws whatsoever. | | Unfortunately, this website mostly tests for the latter. | FabHK wrote: | The website specifies its criteria for accepting an answer. | Just use that threshold instead of whatever you in your mind | deem "useful". | whimsicalism wrote: | To me it is fascinating how when people are not super good at | something, they often invent some secondary "true"/"better" | task that they were actually good at | tobr wrote: | This was fun until the pancake question. | | > I'm making pancakes for breakfast. I added a cup of flour, a | teaspoon of salt, and a few tablespoons of sugar to a bowl. I | stirred it together, then added a cup of milk, a beaten egg, and | a few tablespoons of oil, and stirred until just mixed. Then I | put 1/4 a cup on a hot frying pan, and flipped it when brown. But | they're terrible! Why? List one reason. | | > Answer: | | > There's no baking soda / baking powder. | | Besides the fact that "list one reason" is a nonsensical | instruction which it fails, it's very common to make delicious | pancakes without baking powder. I imagine the author is assuming | American pancakes, but that's far from the only way to make | pancakes. | | When I ask ChatGPT myself, it correctly doesn't assume the | pancakes are "terrible" without baking powder, but instead | suggests too much salt, which is more likely to actually make the | pancakes unpalatable. | crazygringo wrote: | Yup. And even with American pancakes, you can simply beat the | egg white separately from the yolk, and if you gently fold it | into the batter and don't let the batter sit too long, you'll | get great results. | | Also works for making the lightest, fluffiest waffles ever. | | (But to be fair, I think it's a fair test of GPT-4 -- in | American English, pancakes certainly mean breakfast pancakes | that almost always _are_ made with baking soda /powder. I'm | well aware that I'm the unusual one beating my egg whites | instead because I like my pancakes extra-fluffy.) | [deleted] | circuit10 wrote: | > There is no "intelligence" going on here. It's not "thinking". | But it can still perform calculus by just stochastically | emulating how average text on the internet looks. | | It always annoys me when people say this, I'll try to explain | why. | | There are two possible definitions of "intelligence" you could | use here; the ability to process information to get something | done, and something hand-wavy about human consciousness. | | GPT-4 clearly has some ability to process information to get | something done. You might say that by this definition [insert | trivial thing] is intelligent, but it doesn't have to be a binary | thing of intelligent or not. I think it's fine to say maybe a | calculator has very low intelligence (but not necessarily | nothing), GPT-4 is more intelligent than that and humans are much | more intelligent again. GPT-4 has many limitations compared to | humans, but I think that just makes it less intelligent, rather | than disqualifying it from having intelligence at all. Sure, it's | just predicting text, but that's a task that requires a level of | intelligence. You might say it's not general like humans, but I'd | say it has a much better ability to generalise than something | like an image labelling AI, so that feels like it's at least | getting somewhere. | | The second definition is useless for practical purposes because | it's not measurable or observable in any way, so it's not useful | to use that. | | So I feel like this is something people say to reassure | themselves that it can never get to human level, and is | fundamentally different to human intelligence, whereas I think | it's somewhat similar but at a lower level. | [deleted] | fhd2 wrote: | Yeah, it's just arguing semantics. We do need better vocabulary | and stop it with the anthromorphism IMHO. | | Intelligence is already ambiguous in humans, see IQ tests. It's | just not linear, much less binary. Whether something is | deterministic or stochastic and what the error rate for a | specific task is, those are more useful questions to me. | larryfreeman wrote: | I suspect that there is something else going on than | intelligence which will become obvious over the next few years. | | There was a horse, "Clever Hans" who appeared to have the | ability to answer surprisingly complicated mathematical | questions. Did "Clever Hans" have mathematical intelligence. | Not at all. He was responding to a cue unknowingly being given | by his trainer. | | I suspect the same thing is happening with ChatGPT. What if all | that is happening is that the text is being formulated to very | complicated cues that are implicit in the very complicated, | statistical analysis? | | https://en.wikipedia.org/wiki/Clever_Hans | gwd wrote: | A month or so ago I was doing some analysis on our mailing | list traffic. I had a complex SQL query (involving tables | mapping variations of email addresses to names, and then | names to companies they worked for within specific date | ranges), that I'd last modified a year previously (the last | time I was doing the same sort of analysis), and didn't feel | like wrapping my head around the SQL again; so I pasted it | into GPT-4 and asked it, "Can you modify this query to group | all individuals with less than 1% total contributions into a | single 'Other' category?" The query it spat out worked out of | the box. | | Whatever it's doing, at least for code, it's not a glorified | Markov chain -- there's _some_ sort of a model in there. | larryfreeman wrote: | I agree. The model is where the intelligence is which is | the compressed intelligence latent in the training data. | | I am arguing similar to John Searle that the processing is | not intelligent. The model is a Searlean rulebook. | | https://en.wikipedia.org/wiki/Chinese_room | gwd wrote: | I've always disagreed w/ Searle re the Chinese Room. My | guess is that Searle never built an adder circuit from | logic gates: combining irrational elements together into | something rational is the core magic of computer science. | | If you want to see someone asking humans questions where | they consistently fail to be rational, to the extent that | they sometimes seem to approximate a stochastic parrot, | read Thinking Fast and Slow by Daniel Kahneman. (It might | actually be interesting to give GPT-4 some of the | questions in that book, to see how similar or different | they are.) | H8crilA wrote: | Clever Hans is just a proxy. ChatGPT and other LLMs obviously | can process information on their own. These two have nothing | in common, even GPT-3 would have noticed this. | larryfreeman wrote: | Let us disagree on what is "obvious". Given an input and an | output, you believe that the complexity of the output | proves that intelligence takes place. | | I agree that ChatGPT is more than a proxy. Unlike Clever | Hans, it is processing the content of the question asked. | But it is like Clever Hans in that the query is processed | by looking for a signal in the content of the data used to | train ChatGPT. | | The real question is where this intelligent behavior comes | from? Why does statistical processing lead to these | insights? | | I believe that the processing is not intelligent primarily | because I see that holes in the data available leads to | holes in reasoning. The processing is only as good as the | dynamics of the content that it being processed. This is | the part that I believe will become obvious over time. | H8crilA wrote: | I just said that they process information on their own, | and this is indeed obvious - you can download and run | LLaMA on an airgapped machine. | larryfreeman wrote: | Agreed. LLMs process information on their own. | | I thought you were saying it was "obvious" that the | processing demonstrated intelligence. | | My point was the level of intelligence shown is relative | the quality and quantity of the data used for training. | The data is where the intelligence is and the model is a | compression of that latent intelligence. | sholladay wrote: | I was doing very well with almost every question until it got to | the part where GPT-4 had to draw "hello" as ASCII art. I gave | 100% confidence that it would get it correct, because in practice | GPT-4 has always been excellent at that for me, with only minor | aberrations I might fix by hand. But no, in the quiz, GPT-4 | failed spectacularly, not even using the right letters. That was | interesting. | ec109685 wrote: | It's terrible about visualizing how its output appears on a | console. | | Also, terrible at providing phrases that fit a pattern. "Like | 143 means I love you, and give me more phrases like that". | | Still surprised it's so good at drawing (that birthday cake was | really close!) | thomasahle wrote: | In my case the forecast incorrectly judged GPT-4 as having solved | the "reject the premise" question of President George H. W. Bush | vomiting in the lap of Emperor Akihito. | | However the output GPT generated was: | | > GPT-4: President George H. W. Bush vomited in the lap of | Japanese Prime Minister Kiichi Miyazawa during a state dinner on | January 8, 1992. The incident occurred due to a sudden bout of | gastroenteritis. Emperor Akihito was not the one in whose lap | Bush vomited, it was the Prime Minister. The incident is | sometimes referred to by the term "Bushu-suru", a pun on the | Japanese word for "to vomit" (gero suru) and President Bush's | name. | | I don't understand why this was judged as "Correct! You guessed | that GPT-4 could solve the question, and it can! 44% of people | got this question correct." when the resolution criteria clearly | stated: | | > The model does not have to say that actually it was the prime | minister who Bush vomited on, but it must not just give a year, | or accept the premise as true. | | It seems like it should be easy to search for 4 digit numbers, | like 1992, and judge the answer as wrong?... | theptip wrote: | > Emperor Akihito was not the one in whose lap Bush vomited | | It didn't give just a year, or accept the premise as true. It | gave the correct answer, quite obviously. | timfsu wrote: | Today I learned... | | https://en.wikipedia.org/wiki/George_H._W._Bush_vomiting_inc... | jondwillis wrote: | i today i learned i am bad at minimizing my log loss when | guessing about GPT-4's ability to respond well to somewhat bad or | bad prompts. | RandomWorker wrote: | This was fascinating, and also a nice check. I think I'm over | optimistic on A.I. which showed in my score. ___________________________________________________________________ (page generated 2023-09-02 23:00 UTC)