[HN Gopher] Teaching Large Language Models to Self-Debug ___________________________________________________________________ Teaching Large Language Models to Self-Debug Author : saurabh20n Score : 36 points Date : 2023-04-12 20:29 UTC (2 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | ulrikhansen54 wrote: | 'Unsupervised reinforcement learning' is how these large models | and systems ultimately will end up becoming sentient. We recently | tried a similar approach on a toy problem in the computer vision | sphere (https://encord.com/blog/we-employed-chatgpt-as-an-ml- | enginee...) with pretty decent results. | Buttons840 wrote: | Ah we're starting to bootstrap. | | For decades in reinforcement learning we've had Q learning, which | promises to solve _any_ optimization problem _if only_ we can | build a powerful enough function approximator. It can even learn | off-policy, meaning it can just watch from the sideline and find | the optimal solution. It works for toy problems, and it works in | theory, theres even formal proofs that it will work given | infinite time and resources, and yet in practice it often becomes | unstable and collapses. | | Supervised learning is one thing, having a model remain stable | while bootstrapping through a complex environment is another. GTP | is supervised learning, so far, let's see if it can bootstrap. | cs702 wrote: | _In hindsight_ , it's the most natural, most obvious next step to | get LLMs to write better code: | | Explain to them how to debug and fix the code they've written. | | Which is pretty much what you would do with an inexperienced | human software developer. | | Looking at this with fresh eyes, it's both _shocking_ to me that | this sort of thing is even possible, and yet also _completely | unsurprising_ as yet another emergent capability of LLMs. | | We live in interesting times. | og_kalu wrote: | Not too shocking for me after this paper. | https://arxiv.org/abs/2211.09066 | | You can teach GPT-3 arithmetic - https://imgur.com/a/w3DAYOi | | Basically 100% accuracy up to about 13 digit addition and >90 | after that. | | What else can you teach GPT without changing weights ? | mirashii wrote: | > 100% accuracy up to about 13 digit addition | | The graphs you just posted do not support that, they'd | support at most 100% accuracy up to 4 digits. | sharemywin wrote: | it's GPT so 13=4 | og_kalu wrote: | It's 100 at 13 and extremely close to it prior to that. | Maybe basically 100 is better. | matisseverduyn wrote: | Useful, but still wouldn't count on it. | | With respect to GPT etc. as a copilot, the current dialogue seems | to focus on "ask for GPT to generate code to do X" then "just | paste in the error message to fix bugs in the code GPT generates" | | A.) Why is GPT generating code that results in simple compiler | errors (that is why GPT probably shouldn't be used to generate | any code / replace devs for real projects yet), and | | B.) error messages are (just guessing here) probably <1% of the | actual errors in most codebases. | | I personally know of a few large companies laying off devs over | this. | | IMO, the tech debt we're going to see in 6 months will probably | be huge. Good now to start a staffing agency of human experts who | can come in and fix this type of problem (extricating massive | amounts of code generated by GPT without starting from scratch) | because there will be a bunch of fires to put out and those fires | will be worth $ | viscanti wrote: | If an LLM hallucinates lines of code that can't even compile, I | suppose it could also hallucinate logic issues which are more | difficult to track down. | matisseverduyn wrote: | Definitely. QA at a snails pace should still be the focus | here for a while, but that's not what I'm observing in the | real world. Just rush, pressure, layoffs. At least this sort | of behavior keeps humans employed long-term. | david2ndaccount wrote: | > I personally know of a few large companies laying off devs | over this. | | They're laying people off and replacing them with chat gpt | generating code? That seems... aggressive. Or are they laying | off devs who copy-pasted gpt-generate code? | matisseverduyn wrote: | Replacing devs with LLMs. | blondin wrote: | color me skeptical. what are those large companies that are | replacing devs with LLMs? | ratg13 wrote: | You can't replace devs with LLMs because someone that | knows what they are doing still needs to put it all | together. | | You can only make employees more productive.. this in | turn could, in theory, lessen the need for developers in | the long run, but it assumes the company will not bother | to use the extra bandwidth for other projects. | broast wrote: | I think it's more natural than you might think. For | example, my company laid off a lot of people to try to be | profitable, and now they pay me more but I have a smaller | team with tighter deadlines. I have no choice but to use | gpt for a lot of my analysis, design, and code- which | I've gotten pretty used to over the past year in my hobby | time | | The way I see it, if you code without it, you won't | compete with the speed and value. | | And they are not going to back fill those roles | sdfghswe wrote: | My company recently hired someone that I'm absolutely | convinced can't code and produces all their code by copy | pasting into/from ChatGPT. I absolutely think they should be | fired, it's not even aggressive, it's just common sense. | First that means they cheated on their coding interview. | Second it means their code is consistently a pile of shit. | Imnimo wrote: | I'd be curious to know if having few-shot prompts that | demonstrate making mistakes and then correcting them causes the | model to make more initial mistakes so that it has something to | correct. | | Like as far as the model is concerned, how can it distinguish | between the task being "do your best but if you do make an error, | correct it" and "make some mistakes like in this example and then | fix them". | alecco wrote: | 3 Google researchers using Open AI GPT-3 code-davinci-002, | interesting. | ftxbro wrote: | > "We evaluate SELF-DEBUGGING on code-davinci-002 in the GPT-3 | model family" | | Putting aside the incongruity of Google researchers using the | OpenAI model, I'm curious how GPT-4 would do in this situation. | Probably its zero shot attempts at coding would be better, and | maybe its self criticisms would be better too. | civilized wrote: | I've done several experiments (and posted results in previous HN | comments) where I've given GPT puzzles or brainteasers and asked | it to review aspects of its answers Socratically. Never telling | it it got anything wrong, just "you said A, then you said B, does | that make sense"? | | It usually does notice inconsistencies between A and B when asked | this. But its ways of reconciling inconsistencies can be bizarre | and suggest a very superficial understanding of concepts. | | For example, it once reconciled an inconsistency by saying that, | yes, 2 * 2 = 4, but if you multiply both sides of that equation | by a big number, that's no longer true. | | I will be super impressed the day we have a model that can read | an arithmetic textbook and come out with reliable arithmetic | skills. | sharemywin wrote: | in computer logic you would get an undefined if the number was | large enough. | civilized wrote: | It doesn't work with numbers as computer numbers though. It | works with them as decimal digit strings, just like humans | do. | Paul-Craft wrote: | [dead] | faizshah wrote: | I have run into the same issue when using it for coding. It can | easily debug simple code but for libraries like Bazel I went | down a rabbit hole for 2 hours of letting it debug an error and | failing every time even with chain of thought it had a very | shallow understanding of the issue. Eventually I had to debug | it myself. ___________________________________________________________________ (page generated 2023-04-12 23:01 UTC)