[HN Gopher] Teaching Large Language Models to Self-Debug
       ___________________________________________________________________
        
       Teaching Large Language Models to Self-Debug
        
       Author : saurabh20n
       Score  : 36 points
       Date   : 2023-04-12 20:29 UTC (2 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | ulrikhansen54 wrote:
       | 'Unsupervised reinforcement learning' is how these large models
       | and systems ultimately will end up becoming sentient. We recently
       | tried a similar approach on a toy problem in the computer vision
       | sphere (https://encord.com/blog/we-employed-chatgpt-as-an-ml-
       | enginee...) with pretty decent results.
        
       | Buttons840 wrote:
       | Ah we're starting to bootstrap.
       | 
       | For decades in reinforcement learning we've had Q learning, which
       | promises to solve _any_ optimization problem _if only_ we can
       | build a powerful enough function approximator. It can even learn
       | off-policy, meaning it can just watch from the sideline and find
       | the optimal solution. It works for toy problems, and it works in
       | theory, theres even formal proofs that it will work given
       | infinite time and resources, and yet in practice it often becomes
       | unstable and collapses.
       | 
       | Supervised learning is one thing, having a model remain stable
       | while bootstrapping through a complex environment is another. GTP
       | is supervised learning, so far, let's see if it can bootstrap.
        
       | cs702 wrote:
       | _In hindsight_ , it's the most natural, most obvious next step to
       | get LLMs to write better code:
       | 
       | Explain to them how to debug and fix the code they've written.
       | 
       | Which is pretty much what you would do with an inexperienced
       | human software developer.
       | 
       | Looking at this with fresh eyes, it's both _shocking_ to me that
       | this sort of thing is even possible, and yet also _completely
       | unsurprising_ as yet another emergent capability of LLMs.
       | 
       | We live in interesting times.
        
         | og_kalu wrote:
         | Not too shocking for me after this paper.
         | https://arxiv.org/abs/2211.09066
         | 
         | You can teach GPT-3 arithmetic - https://imgur.com/a/w3DAYOi
         | 
         | Basically 100% accuracy up to about 13 digit addition and >90
         | after that.
         | 
         | What else can you teach GPT without changing weights ?
        
           | mirashii wrote:
           | > 100% accuracy up to about 13 digit addition
           | 
           | The graphs you just posted do not support that, they'd
           | support at most 100% accuracy up to 4 digits.
        
             | sharemywin wrote:
             | it's GPT so 13=4
        
             | og_kalu wrote:
             | It's 100 at 13 and extremely close to it prior to that.
             | Maybe basically 100 is better.
        
       | matisseverduyn wrote:
       | Useful, but still wouldn't count on it.
       | 
       | With respect to GPT etc. as a copilot, the current dialogue seems
       | to focus on "ask for GPT to generate code to do X" then "just
       | paste in the error message to fix bugs in the code GPT generates"
       | 
       | A.) Why is GPT generating code that results in simple compiler
       | errors (that is why GPT probably shouldn't be used to generate
       | any code / replace devs for real projects yet), and
       | 
       | B.) error messages are (just guessing here) probably <1% of the
       | actual errors in most codebases.
       | 
       | I personally know of a few large companies laying off devs over
       | this.
       | 
       | IMO, the tech debt we're going to see in 6 months will probably
       | be huge. Good now to start a staffing agency of human experts who
       | can come in and fix this type of problem (extricating massive
       | amounts of code generated by GPT without starting from scratch)
       | because there will be a bunch of fires to put out and those fires
       | will be worth $
        
         | viscanti wrote:
         | If an LLM hallucinates lines of code that can't even compile, I
         | suppose it could also hallucinate logic issues which are more
         | difficult to track down.
        
           | matisseverduyn wrote:
           | Definitely. QA at a snails pace should still be the focus
           | here for a while, but that's not what I'm observing in the
           | real world. Just rush, pressure, layoffs. At least this sort
           | of behavior keeps humans employed long-term.
        
         | david2ndaccount wrote:
         | > I personally know of a few large companies laying off devs
         | over this.
         | 
         | They're laying people off and replacing them with chat gpt
         | generating code? That seems... aggressive. Or are they laying
         | off devs who copy-pasted gpt-generate code?
        
           | matisseverduyn wrote:
           | Replacing devs with LLMs.
        
             | blondin wrote:
             | color me skeptical. what are those large companies that are
             | replacing devs with LLMs?
        
               | ratg13 wrote:
               | You can't replace devs with LLMs because someone that
               | knows what they are doing still needs to put it all
               | together.
               | 
               | You can only make employees more productive.. this in
               | turn could, in theory, lessen the need for developers in
               | the long run, but it assumes the company will not bother
               | to use the extra bandwidth for other projects.
        
               | broast wrote:
               | I think it's more natural than you might think. For
               | example, my company laid off a lot of people to try to be
               | profitable, and now they pay me more but I have a smaller
               | team with tighter deadlines. I have no choice but to use
               | gpt for a lot of my analysis, design, and code- which
               | I've gotten pretty used to over the past year in my hobby
               | time
               | 
               | The way I see it, if you code without it, you won't
               | compete with the speed and value.
               | 
               | And they are not going to back fill those roles
        
           | sdfghswe wrote:
           | My company recently hired someone that I'm absolutely
           | convinced can't code and produces all their code by copy
           | pasting into/from ChatGPT. I absolutely think they should be
           | fired, it's not even aggressive, it's just common sense.
           | First that means they cheated on their coding interview.
           | Second it means their code is consistently a pile of shit.
        
       | Imnimo wrote:
       | I'd be curious to know if having few-shot prompts that
       | demonstrate making mistakes and then correcting them causes the
       | model to make more initial mistakes so that it has something to
       | correct.
       | 
       | Like as far as the model is concerned, how can it distinguish
       | between the task being "do your best but if you do make an error,
       | correct it" and "make some mistakes like in this example and then
       | fix them".
        
       | alecco wrote:
       | 3 Google researchers using Open AI GPT-3 code-davinci-002,
       | interesting.
        
       | ftxbro wrote:
       | > "We evaluate SELF-DEBUGGING on code-davinci-002 in the GPT-3
       | model family"
       | 
       | Putting aside the incongruity of Google researchers using the
       | OpenAI model, I'm curious how GPT-4 would do in this situation.
       | Probably its zero shot attempts at coding would be better, and
       | maybe its self criticisms would be better too.
        
       | civilized wrote:
       | I've done several experiments (and posted results in previous HN
       | comments) where I've given GPT puzzles or brainteasers and asked
       | it to review aspects of its answers Socratically. Never telling
       | it it got anything wrong, just "you said A, then you said B, does
       | that make sense"?
       | 
       | It usually does notice inconsistencies between A and B when asked
       | this. But its ways of reconciling inconsistencies can be bizarre
       | and suggest a very superficial understanding of concepts.
       | 
       | For example, it once reconciled an inconsistency by saying that,
       | yes, 2 * 2 = 4, but if you multiply both sides of that equation
       | by a big number, that's no longer true.
       | 
       | I will be super impressed the day we have a model that can read
       | an arithmetic textbook and come out with reliable arithmetic
       | skills.
        
         | sharemywin wrote:
         | in computer logic you would get an undefined if the number was
         | large enough.
        
           | civilized wrote:
           | It doesn't work with numbers as computer numbers though. It
           | works with them as decimal digit strings, just like humans
           | do.
        
             | Paul-Craft wrote:
             | [dead]
        
         | faizshah wrote:
         | I have run into the same issue when using it for coding. It can
         | easily debug simple code but for libraries like Bazel I went
         | down a rabbit hole for 2 hours of letting it debug an error and
         | failing every time even with chain of thought it had a very
         | shallow understanding of the issue. Eventually I had to debug
         | it myself.
        
       ___________________________________________________________________
       (page generated 2023-04-12 23:01 UTC)