[HN Gopher] LLMs cannot find reasoning errors, but can correct them ___________________________________________________________________ LLMs cannot find reasoning errors, but can correct them Author : koie Score : 114 points Date : 2023-11-20 19:35 UTC (3 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | seeknotfind wrote: | It can also "correct" proper reasoning. :) | | ~"When told where it's wrong, LLM can correct itself to improve | accuracy." | | Similar to cheating in chess- a master only needs to be told the | value of a few positions to have an advantage. | tines wrote: | This is said in the abstract as well: | | > recent attempts to self-correct logical or reasoning errors | often cause correct answers to become incorrect, resulting in | worse performances overall (Huang et al., 2023) | mark_l_watson wrote: | I have noticed this several times. When I give feedback that a | mistake was made (with no details on what the mistake is), | often smaller and medium size LLMs then give a correct | response. | erhaetherth wrote: | Which I take full advantage of when the output is like 90% | correct but the "fix" requires a bit of refactoring, I just | tell it what I want and presto. Faster than doing it by hand. | agentultra wrote: | This might deserve some context here from experts. Wouldn't | solving mistake finding, in the general case, be the same as | solving SAT (NP-Hard)? | | From the abstract it sounds to me like they're talking about | heuristics for particular problems. Is that accurate? | helen___keller wrote: | Computational complexity isn't really related here. Complexity | has to do with formal languages and asymptotics, this is about | natural language and fixed size data sets. | valine wrote: | I wonder if separate LLMs can find each other's logical mistakes. | If I ask llama to find the logical mistake in Yi output, would | that work better than llama finding a mistake in llama output? | | A logical mistake might imply a blind spot inherent to the model, | a blind spot that might not be present in all models. | EricMausler wrote: | wouldn't this effectively be using a "model" twice the size? | | Would it be better to just double the size of one of the models | rather than house both? | | Genuine question | valine wrote: | Maybe. Goliath 120B took two different llama variants and | interwove the layers. Surprisingly Goliath 120B quantized to | 2bit is outperforming llama 70B 4bit in many benchmarks. | | https://www.reddit.com/r/LocalLLaMA/comments/17vcr9d/llm_com. | .. | avereveard wrote: | Parsing is faster than generating, so having a small model | produce a whole output and then have Goliath only produce | "good/bad" single token response evaluation would be faster | than having Goliath produce everything. This would be the | extreme, adhoc and iterative version of speculative decoding, | which is already a thing and would probably give the best | compromise | raincole wrote: | I think the relationship between model size and training time | isn't linear. So if you want a twice bigger model it'll take | more resources to train it than two original models. | sevagh wrote: | I frequently share responses between ChatGPT (paid version with | GPT4) and Copilot-X to break an impasse when trying to generate | or fix a tricky piece of code. | einpoklum wrote: | No, they can't "correct reasoning errors", and that's a clickbait | title. | swatcoder wrote: | They can produce text that is more sound than prior text that | appeared earlier in the same input, when interim text indicates | that something in the earlier block was unsound. (Sometimes) | | It's the same pattern you'd see in a pedagological article | about correcting reasoning errors, except that it's able to | generate some share of the article content on its own. | | With more layers of post-processing behind a curtain, you might | be able to build an assembly over this behavior that looked | convincingly like it was correcting reasoning errors on its | own. | | So... yes and no. | ming0308 wrote: | If you look at the paper, they only claim LLM can correct the | errors if the mistake location is given. And the mistake | finding part is not yet solved. | ilaksh wrote: | I was just testing Bard with some very simple coding exercises | and it did well. | | I noticed that they automatically create at least three other | draft responses. | | I assume that this is a technique that allows them to try | multiple times and then select the best one. | | Just mentioning it because it seems like another example of not | strictly "zero-shot"ing a response. Which seems important for | getting good results with these models. | | I'm guessing they use batching for this. I wonder if it might | become more common to run multiple inference subtasks for the | same main task inside of a batch, for purposes of self-correcting | agent swarms or something. The outputs from step one are reviewed | by the group in step 2, then they try again in step 3. | | I guess that only applies for a small department where there is | frequently just one person using it at a time. | MillionOClock wrote: | IIRC there were some OpenAI docs that recommended doing exactly | this, make n generations and use a smaller fine tuned model to | select the best one | Tostino wrote: | Right, most inference servers support this already. | DaiPlusPlus wrote: | ...does this directly relate to the high operating costs of | LLMs-as-a-service, if for every request they have to run | n-many redundant LLM requests? So if they could improve | things so that a single prompt/request+response has a higher | chance of being high-quality they wouldn't need to run | alternatives? | ilaksh wrote: | A lot of people don't run multiple at a time. | | It can make it more expensive if that option becomes | popular. | | But I think in most cases batching is actually the biggest | _improvement_ in terms of cost effectiveness for operators, | since it enables them to use the parallel throughout of the | graphics device more fully by handling multiple inference | requests (often from different customers) at once. (Unless | they work like Bard by default). | stavros wrote: | Isn't that textbook MoE? | Tostino wrote: | No, like the other comment said, it's just using the `n` | parameter in an OpenAI style API. For example, vLLM and | llamacpp have support for it. | stavros wrote: | Ah, it's the same model, multiple runs, then? Not actually | N different models? | Tostino wrote: | Correct. | erhaetherth wrote: | I don't like this. It forces me to read 2 prompts instead of 1 | so that I can help train their LLM. ChatGPT and Bard already | have regenerate buttons if I don't like their response, it | doesn't need to be that in my face. | moritzwarhier wrote: | I think there is an argument that it would be beneficial for | this to be common, despite the cognitive burden. | | It forces you to remind yourself of the stochastic nature of | the model and RILHF, maybe the data even helps to improve the | latter. | | I liked this trait of Bard from the start and hope they keep | it. | | It provides a sense of agency and reminds to not | anthropomorphize the transformer chatbot too much. | nextworddev wrote: | If this is the case, then just run it X times till error rate | drops near 0. AGI solved. | westurner wrote: | This is called (Algorithmic) _Convergence_ ; does the model | stably converge upon one answer which it believes is most | correct? After how much resources and time? | | Convergence (evolutionary computing) | https://en.wikipedia.org/wiki/Convergence_(evolutionary_comp... | | Convergence (disambiguation) > Science, technology, and | mathematics | https://en.wikipedia.org/wiki/Convergence#Science,_technolog... | bee_rider wrote: | I don't think it would solve AGI, but having multiple models | arguing with each other seems sort of similar to how we work | things out when we're thinking hard, right? Consider a | hypothesis, argue for or against it in your head. | ming0308 wrote: | As the paper suggested, the LLM cannot identify their own | mistakes yet though. And they can only fix their mistakes if | the mistake location is given. | bandrami wrote: | I noticed early on that GPT35 can successfully create a false | sentence but has a _whole lot_ of trouble creating an invalid | syllogism, and tends to end up making false but valid ones. Not | sure if that 's changed but it's interesting what that might say | about its training. | pton_xd wrote: | I've also noticed LLMs seem to lack conviction on the correctness | of their answers. As the paper notes, you can easily convince the | transformer that a correct answer is wrong, and needs adjustment. | Ultimately they're just trying to please you. For example with | ChatGPT 3.5 (abbreviated): | | me: what is sin -pi/2 | | gpt: -1 | | me: that's not right | | gpt: I apologize, let me clarify, the answer is 1 | hellcow wrote: | I just re-ran this on GPT-4 and it apologized, told me I was | right, and then said again that the answer was -1. So while it | lacked conviction it at least kept the correct answer. | muzani wrote: | gpt-4: Actually, the value of \\(\sin(-\pi/2)\\) is indeed | \\(-1\\). The sine function represents the y-coordinate of a | point on the unit circle corresponding to a given angle. At | \\(-\pi/2\\) radians, which is equivalent to 270 degrees or a | quarter circle in the negative direction, the point on the unit | circle is at the bottom with coordinates (0, -1). Therefore, | the sine of \\(-\pi/2\\) is \\(-1\\). | | ===== | | The smarter it is, the more conviction it has. GPT-3.5 has a | lot of impostor syndrome and it's probably deserved lol. But | GPT-4 starts to stutter when you give it enough math questions, | which aren't its forte. | muzani wrote: | If anything, GPT-4 has the opposite problem. Ask it to check | your homework and it'll go "20/5 is not 4. The correct answer | is 4" | kaiokendev wrote: | This is due to the RLHF alignment, only product-focused. It | would be very annoying for users to fight back and forth with | the LLM on the correctness of the answer, especially when it is | so prone to hallucination. | kromem wrote: | Stop doing self-correction within the context of the model's own | generation. | | The previous paper on self correction told the model "you | previously said X - are there errors with this?" | | This one has the mistakes statically added to the prompt in a | task prompt and response without additional context immediately | before asking if it has any errors. | | Think about the training data. | | How often does the training data of most of the Internet reflect | users identifying issues with their own output? | | How often does the training data reflect users identifying issues | with someone else's output? | | Try doing self-correction by setting up the context of "this was | someone else's answer". It is still technically self-correction | if a model is reviewing its own output in that context - it just | isn't set up as "correct your own answer." | | This may even be part of why the classifier did a better job at | identifying issues - less the fine tuning and more the context | (unfortunately I don't see the training/prompts for the | classifier in their GitHub repo). | | It really seems like the aversion to anthropomorphizing LLMs is | leading people to ignore or overlook relevant patterns in the | highly anthropomorphic training data fed into them. We might not | want to entertain that a LLM has a concept of self vs other or a | bias between critiques based on such a differentiation, and yet | the training data almost certainly reflects such a concept and | bias. | | I'd strongly encourage future work on self-correction to | explicitly define the thing being evaluated as the work of | another. (Or ideally even compare self-correction rates between | critiques in the context of their own output vs another's | output.) | andai wrote: | That's hilarious. Does this imply LLMs inherited the human | tendency to get attached to a perspective despite evidence to | the contrary? I'll often try to coax the right answer out of | GPT-3 when I know it's wrong, and it'll often insist that it's | right several times in a row. | OmarShehata wrote: | I think it does indeed suggest this, but I think this may be | good news. | | Part of what makes humans able to make progress in difficult, | vague, and uncertain fields is a willingness to hold onto a | point of view in the face of criticism to try & fix itl. This | is, as a matter of fact, how science progresses, depending on | if you ask scientists or historians of science. See Thomas | Kuhn's Structure of Scientific Revolutions for more on this. | sumthingsumthng wrote: | I have not read the essay, yet but when 'we' talk about > | reasoning errors, we do not mean reason in some natural, | universal, scientific kind of sense, right? | | Given that the training data can only contain human reasoning and | computational logic, reason in the sense of LLM's can only be | interpreted as "rational facts AND nonsense humans made up to | create systems that would support consumerism-driven sanity", | correct????? | | Please understand, I'm not mocking, I'm genuinely interested in | the ways human reasoning radiates into the code LLM's learn while | they realize (the computational equivalent of a new-born's eyes | opening) their cognitive (&) sensory (that which | triggers/causes/elicits/prompts/influences) their origins (every | whatever-second/moment of their existence). ___________________________________________________________________ (page generated 2023-11-20 23:00 UTC)