[HN Gopher] How RLHF Works ___________________________________________________________________ How RLHF Works Author : natolambert Score : 122 points Date : 2023-06-21 14:21 UTC (8 hours ago) (HTM) web link (www.interconnects.ai) (TXT) w3m dump (www.interconnects.ai) | [deleted] | RicDan wrote: | Problem with this is that it leads to the algorithm targeting | outputs that sound good for humans. Thats why its bad and wont | help us, it should also incorporate ,,sorry dont know that", but | for that it needs to actually be smart | m00x wrote: | It can be weighted to be more honest when it doesn't know if | those answers are picked by the labeler. | dr_dshiv wrote: | Need smarter labelers | cubefox wrote: | Honesty/truthfulness is indeed a difficult problem with any | kind of fine-tuning. There is no way to incentivize the model | to say what it believes to be true rather than what human | raters would regard as true. Future models could become | actively deceptive. | noam_compsci wrote: | Not very good. I just want a step by step ultra high level | explanation. 1. Build a model 2. Run it ten times 3. Get humans | to do xyz until result abc. | pestatije wrote: | RLHF - reinforcement learning from human feedback | 1bent wrote: | Thank you! | cylon13 wrote: | A notable improvement over the GLHF strategy for interacting | with GPT models. | lcnPylGDnU4H9OF wrote: | (In case anybody's confused by the gaming culture reference: | https://en.wiktionary.org/wiki/glhf. "Good Luck Have Fun") | H8crilA wrote: | This says nothing on how RLHF works, but a lot on what can be the | results. | SleekEagle wrote: | You can check here for an explanation (with some helpful | figures) https://www.assemblyai.com/blog/the-full-story-of- | large-lang... | inciampati wrote: | Yes! I came to make the same comment. | | It's got a catchy title but it leaves much to be resolved. | victor106 wrote: | Anyone here know where we can find more resources on RLHF? | | There's been a lot written about transformer models etc., but I | wasn't able to find much about RLHF. | rounakdatta wrote: | There's also this exhaustive post from one and only Chip Huyen: | https://huyenchip.com/2023/05/02/rlhf.html | SleekEagle wrote: | My colleague wrote a couple of pieces that talk about RLHF: | | 1. https://www.assemblyai.com/blog/the-full-story-of-large- | lang... (you can scroll to "What RLHF actually does to an LLM" | if you're already familiar with LLMs) | | 2. https://www.assemblyai.com/blog/how-chatgpt-actually-works/ | hansvm wrote: | It's not the first paper on the topic IIRC, but OpenAI's | InstructGPT paper [0] is decent and references enough other | material to get started. | | The key idea is that they're able to start with large amounts | of relatively garbage unsupervised data (the internet), and use | that model to cheaply generate decent amounts of better data | (ranking generated content rather than spending the man-hours | to actually write good content). The other details aren't too | important. | | [0] https://arxiv.org/abs/2203.02155 | senko wrote: | Blog post from Huggingface: https://huggingface.co/blog/rlhf | | Webinar on the same topic (from same HF folks): | https://www.youtube.com/watch?v=2MBJOuVq380&t=496s | | RLHF as used by OpenAI in InstructGPT (predecessor to ChatGPT): | https://arxiv.org/abs/2203.02155 (academic paper, so much | denser than the above two resources) | samstave wrote: | It will be interesting when we have AI doing RLHF to other | AIs based on itself being RLHF'd and having an iterative AI | model reinforcement... | | But we talk of 'hallucinations' but what we wont get is AI | malfeasense identified by AI RLHF trickery/lying? | z3c0 wrote: | This is essentially the premise behind Generative | Adversarial Networks, and if you've seen the results, | they're astounding. They're much better for specialized | tasks than their generalized GPT counterparts. | samstave wrote: | Please expand on this? | p1esk wrote: | Original RLHF paper: https://arxiv.org/abs/1706.03741 | abscind wrote: | Any reason RLHF isn't just a band-aid on "not having enough | data?" | trade_monkey wrote: | RLHF is a band aid on not having enough data that fits your own | biases and answers you want the model to give. | wmwmwm wrote: | Does anyone have any insight into why reinforcement learning is | (maybe) required/historically favoured? There was an interesting | paper recently suggesting that you can use a preference learning | objective directly and get a similar/better result without the RL | machinery - but I lack the right intuition to know whether RLHF | offers some additional magic! Here's the " Direct Preference | Optimization " paper: https://arxiv.org/abs/2305.18290 | fardo wrote: | > Does anyone have any insight into why reinforcement learning | is (maybe) required/historically favoured? | | From a concept stage, it has attractive similarities to the way | people learn in real life (rewarded for successful learnings, | punished for failure), and although we know similarities to | nature don't guarantee better results than alternatives (for | example, our modern airplane does not "flap" its wings the way | a bird does), natural solutions will be continually looked to | as a starting point and tool to try on new problems. | | Additionally, RL gives you a good start on unclear-how-to- | address problems. In spaces where it's not clear where to begin | optimizing besides taking actions and seeing how they do judged | against some metric, reinforcement learning often provides a | good mental and code framework for attacking these problems. | | >There was a paper recently suggesting that you can use a | preference learning objective directly | | Doing a very quick skim, it looks like that paper is arguing | rather than giving rewards or punishments based on preferences, | you can just build a predictive classifier for the kinds of | responses humans prefer. It seems interesting, though I wonder | the extent to which you still have to occasionally do that | reinforcement learning to generate relevant data for evaluating | the classifier. | gradys wrote: | My intuition on this: | | Maximum likelihood training -> faithfully represent training | data | | Reinforcement learning -> seek out the most preferred answer | you can ___________________________________________________________________ (page generated 2023-06-21 23:01 UTC)