[HN Gopher] LLMs cannot find reasoning errors, but can correct them
       ___________________________________________________________________
        
       LLMs cannot find reasoning errors, but can correct them
        
       Author : koie
       Score  : 114 points
       Date   : 2023-11-20 19:35 UTC (3 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | seeknotfind wrote:
       | It can also "correct" proper reasoning. :)
       | 
       | ~"When told where it's wrong, LLM can correct itself to improve
       | accuracy."
       | 
       | Similar to cheating in chess- a master only needs to be told the
       | value of a few positions to have an advantage.
        
         | tines wrote:
         | This is said in the abstract as well:
         | 
         | > recent attempts to self-correct logical or reasoning errors
         | often cause correct answers to become incorrect, resulting in
         | worse performances overall (Huang et al., 2023)
        
         | mark_l_watson wrote:
         | I have noticed this several times. When I give feedback that a
         | mistake was made (with no details on what the mistake is),
         | often smaller and medium size LLMs then give a correct
         | response.
        
           | erhaetherth wrote:
           | Which I take full advantage of when the output is like 90%
           | correct but the "fix" requires a bit of refactoring, I just
           | tell it what I want and presto. Faster than doing it by hand.
        
       | agentultra wrote:
       | This might deserve some context here from experts. Wouldn't
       | solving mistake finding, in the general case, be the same as
       | solving SAT (NP-Hard)?
       | 
       | From the abstract it sounds to me like they're talking about
       | heuristics for particular problems. Is that accurate?
        
         | helen___keller wrote:
         | Computational complexity isn't really related here. Complexity
         | has to do with formal languages and asymptotics, this is about
         | natural language and fixed size data sets.
        
       | valine wrote:
       | I wonder if separate LLMs can find each other's logical mistakes.
       | If I ask llama to find the logical mistake in Yi output, would
       | that work better than llama finding a mistake in llama output?
       | 
       | A logical mistake might imply a blind spot inherent to the model,
       | a blind spot that might not be present in all models.
        
         | EricMausler wrote:
         | wouldn't this effectively be using a "model" twice the size?
         | 
         | Would it be better to just double the size of one of the models
         | rather than house both?
         | 
         | Genuine question
        
           | valine wrote:
           | Maybe. Goliath 120B took two different llama variants and
           | interwove the layers. Surprisingly Goliath 120B quantized to
           | 2bit is outperforming llama 70B 4bit in many benchmarks.
           | 
           | https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_com.
           | ..
        
           | avereveard wrote:
           | Parsing is faster than generating, so having a small model
           | produce a whole output and then have Goliath only produce
           | "good/bad" single token response evaluation would be faster
           | than having Goliath produce everything. This would be the
           | extreme, adhoc and iterative version of speculative decoding,
           | which is already a thing and would probably give the best
           | compromise
        
           | raincole wrote:
           | I think the relationship between model size and training time
           | isn't linear. So if you want a twice bigger model it'll take
           | more resources to train it than two original models.
        
         | sevagh wrote:
         | I frequently share responses between ChatGPT (paid version with
         | GPT4) and Copilot-X to break an impasse when trying to generate
         | or fix a tricky piece of code.
        
       | einpoklum wrote:
       | No, they can't "correct reasoning errors", and that's a clickbait
       | title.
        
         | swatcoder wrote:
         | They can produce text that is more sound than prior text that
         | appeared earlier in the same input, when interim text indicates
         | that something in the earlier block was unsound. (Sometimes)
         | 
         | It's the same pattern you'd see in a pedagological article
         | about correcting reasoning errors, except that it's able to
         | generate some share of the article content on its own.
         | 
         | With more layers of post-processing behind a curtain, you might
         | be able to build an assembly over this behavior that looked
         | convincingly like it was correcting reasoning errors on its
         | own.
         | 
         | So... yes and no.
        
         | ming0308 wrote:
         | If you look at the paper, they only claim LLM can correct the
         | errors if the mistake location is given. And the mistake
         | finding part is not yet solved.
        
       | ilaksh wrote:
       | I was just testing Bard with some very simple coding exercises
       | and it did well.
       | 
       | I noticed that they automatically create at least three other
       | draft responses.
       | 
       | I assume that this is a technique that allows them to try
       | multiple times and then select the best one.
       | 
       | Just mentioning it because it seems like another example of not
       | strictly "zero-shot"ing a response. Which seems important for
       | getting good results with these models.
       | 
       | I'm guessing they use batching for this. I wonder if it might
       | become more common to run multiple inference subtasks for the
       | same main task inside of a batch, for purposes of self-correcting
       | agent swarms or something. The outputs from step one are reviewed
       | by the group in step 2, then they try again in step 3.
       | 
       | I guess that only applies for a small department where there is
       | frequently just one person using it at a time.
        
         | MillionOClock wrote:
         | IIRC there were some OpenAI docs that recommended doing exactly
         | this, make n generations and use a smaller fine tuned model to
         | select the best one
        
           | Tostino wrote:
           | Right, most inference servers support this already.
        
           | DaiPlusPlus wrote:
           | ...does this directly relate to the high operating costs of
           | LLMs-as-a-service, if for every request they have to run
           | n-many redundant LLM requests? So if they could improve
           | things so that a single prompt/request+response has a higher
           | chance of being high-quality they wouldn't need to run
           | alternatives?
        
             | ilaksh wrote:
             | A lot of people don't run multiple at a time.
             | 
             | It can make it more expensive if that option becomes
             | popular.
             | 
             | But I think in most cases batching is actually the biggest
             | _improvement_ in terms of cost effectiveness for operators,
             | since it enables them to use the parallel throughout of the
             | graphics device more fully by handling multiple inference
             | requests (often from different customers) at once. (Unless
             | they work like Bard by default).
        
         | stavros wrote:
         | Isn't that textbook MoE?
        
           | Tostino wrote:
           | No, like the other comment said, it's just using the `n`
           | parameter in an OpenAI style API. For example, vLLM and
           | llamacpp have support for it.
        
             | stavros wrote:
             | Ah, it's the same model, multiple runs, then? Not actually
             | N different models?
        
               | Tostino wrote:
               | Correct.
        
         | erhaetherth wrote:
         | I don't like this. It forces me to read 2 prompts instead of 1
         | so that I can help train their LLM. ChatGPT and Bard already
         | have regenerate buttons if I don't like their response, it
         | doesn't need to be that in my face.
        
           | moritzwarhier wrote:
           | I think there is an argument that it would be beneficial for
           | this to be common, despite the cognitive burden.
           | 
           | It forces you to remind yourself of the stochastic nature of
           | the model and RILHF, maybe the data even helps to improve the
           | latter.
           | 
           | I liked this trait of Bard from the start and hope they keep
           | it.
           | 
           | It provides a sense of agency and reminds to not
           | anthropomorphize the transformer chatbot too much.
        
       | nextworddev wrote:
       | If this is the case, then just run it X times till error rate
       | drops near 0. AGI solved.
        
         | westurner wrote:
         | This is called (Algorithmic) _Convergence_ ; does the model
         | stably converge upon one answer which it believes is most
         | correct? After how much resources and time?
         | 
         | Convergence (evolutionary computing)
         | https://en.wikipedia.org/wiki/Convergence_(evolutionary_comp...
         | 
         | Convergence (disambiguation) > Science, technology, and
         | mathematics
         | https://en.wikipedia.org/wiki/Convergence#Science,_technolog...
        
         | bee_rider wrote:
         | I don't think it would solve AGI, but having multiple models
         | arguing with each other seems sort of similar to how we work
         | things out when we're thinking hard, right? Consider a
         | hypothesis, argue for or against it in your head.
        
         | ming0308 wrote:
         | As the paper suggested, the LLM cannot identify their own
         | mistakes yet though. And they can only fix their mistakes if
         | the mistake location is given.
        
       | bandrami wrote:
       | I noticed early on that GPT35 can successfully create a false
       | sentence but has a _whole lot_ of trouble creating an invalid
       | syllogism, and tends to end up making false but valid ones. Not
       | sure if that 's changed but it's interesting what that might say
       | about its training.
        
       | pton_xd wrote:
       | I've also noticed LLMs seem to lack conviction on the correctness
       | of their answers. As the paper notes, you can easily convince the
       | transformer that a correct answer is wrong, and needs adjustment.
       | Ultimately they're just trying to please you. For example with
       | ChatGPT 3.5 (abbreviated):
       | 
       | me: what is sin -pi/2
       | 
       | gpt: -1
       | 
       | me: that's not right
       | 
       | gpt: I apologize, let me clarify, the answer is 1
        
         | hellcow wrote:
         | I just re-ran this on GPT-4 and it apologized, told me I was
         | right, and then said again that the answer was -1. So while it
         | lacked conviction it at least kept the correct answer.
        
         | muzani wrote:
         | gpt-4: Actually, the value of \\(\sin(-\pi/2)\\) is indeed
         | \\(-1\\). The sine function represents the y-coordinate of a
         | point on the unit circle corresponding to a given angle. At
         | \\(-\pi/2\\) radians, which is equivalent to 270 degrees or a
         | quarter circle in the negative direction, the point on the unit
         | circle is at the bottom with coordinates (0, -1). Therefore,
         | the sine of \\(-\pi/2\\) is \\(-1\\).
         | 
         | =====
         | 
         | The smarter it is, the more conviction it has. GPT-3.5 has a
         | lot of impostor syndrome and it's probably deserved lol. But
         | GPT-4 starts to stutter when you give it enough math questions,
         | which aren't its forte.
        
           | muzani wrote:
           | If anything, GPT-4 has the opposite problem. Ask it to check
           | your homework and it'll go "20/5 is not 4. The correct answer
           | is 4"
        
         | kaiokendev wrote:
         | This is due to the RLHF alignment, only product-focused. It
         | would be very annoying for users to fight back and forth with
         | the LLM on the correctness of the answer, especially when it is
         | so prone to hallucination.
        
       | kromem wrote:
       | Stop doing self-correction within the context of the model's own
       | generation.
       | 
       | The previous paper on self correction told the model "you
       | previously said X - are there errors with this?"
       | 
       | This one has the mistakes statically added to the prompt in a
       | task prompt and response without additional context immediately
       | before asking if it has any errors.
       | 
       | Think about the training data.
       | 
       | How often does the training data of most of the Internet reflect
       | users identifying issues with their own output?
       | 
       | How often does the training data reflect users identifying issues
       | with someone else's output?
       | 
       | Try doing self-correction by setting up the context of "this was
       | someone else's answer". It is still technically self-correction
       | if a model is reviewing its own output in that context - it just
       | isn't set up as "correct your own answer."
       | 
       | This may even be part of why the classifier did a better job at
       | identifying issues - less the fine tuning and more the context
       | (unfortunately I don't see the training/prompts for the
       | classifier in their GitHub repo).
       | 
       | It really seems like the aversion to anthropomorphizing LLMs is
       | leading people to ignore or overlook relevant patterns in the
       | highly anthropomorphic training data fed into them. We might not
       | want to entertain that a LLM has a concept of self vs other or a
       | bias between critiques based on such a differentiation, and yet
       | the training data almost certainly reflects such a concept and
       | bias.
       | 
       | I'd strongly encourage future work on self-correction to
       | explicitly define the thing being evaluated as the work of
       | another. (Or ideally even compare self-correction rates between
       | critiques in the context of their own output vs another's
       | output.)
        
         | andai wrote:
         | That's hilarious. Does this imply LLMs inherited the human
         | tendency to get attached to a perspective despite evidence to
         | the contrary? I'll often try to coax the right answer out of
         | GPT-3 when I know it's wrong, and it'll often insist that it's
         | right several times in a row.
        
           | OmarShehata wrote:
           | I think it does indeed suggest this, but I think this may be
           | good news.
           | 
           | Part of what makes humans able to make progress in difficult,
           | vague, and uncertain fields is a willingness to hold onto a
           | point of view in the face of criticism to try & fix itl. This
           | is, as a matter of fact, how science progresses, depending on
           | if you ask scientists or historians of science. See Thomas
           | Kuhn's Structure of Scientific Revolutions for more on this.
        
       | sumthingsumthng wrote:
       | I have not read the essay, yet but when 'we' talk about >
       | reasoning errors, we do not mean reason in some natural,
       | universal, scientific kind of sense, right?
       | 
       | Given that the training data can only contain human reasoning and
       | computational logic, reason in the sense of LLM's can only be
       | interpreted as "rational facts AND nonsense humans made up to
       | create systems that would support consumerism-driven sanity",
       | correct?????
       | 
       | Please understand, I'm not mocking, I'm genuinely interested in
       | the ways human reasoning radiates into the code LLM's learn while
       | they realize (the computational equivalent of a new-born's eyes
       | opening) their cognitive (&) sensory (that which
       | triggers/causes/elicits/prompts/influences) their origins (every
       | whatever-second/moment of their existence).
        
       ___________________________________________________________________
       (page generated 2023-11-20 23:00 UTC)