[HN Gopher] How RLHF Works
       ___________________________________________________________________
        
       How RLHF Works
        
       Author : natolambert
       Score  : 122 points
       Date   : 2023-06-21 14:21 UTC (8 hours ago)
        
 (HTM) web link (www.interconnects.ai)
 (TXT) w3m dump (www.interconnects.ai)
        
       | [deleted]
        
       | RicDan wrote:
       | Problem with this is that it leads to the algorithm targeting
       | outputs that sound good for humans. Thats why its bad and wont
       | help us, it should also incorporate ,,sorry dont know that", but
       | for that it needs to actually be smart
        
         | m00x wrote:
         | It can be weighted to be more honest when it doesn't know if
         | those answers are picked by the labeler.
        
           | dr_dshiv wrote:
           | Need smarter labelers
        
         | cubefox wrote:
         | Honesty/truthfulness is indeed a difficult problem with any
         | kind of fine-tuning. There is no way to incentivize the model
         | to say what it believes to be true rather than what human
         | raters would regard as true. Future models could become
         | actively deceptive.
        
       | noam_compsci wrote:
       | Not very good. I just want a step by step ultra high level
       | explanation. 1. Build a model 2. Run it ten times 3. Get humans
       | to do xyz until result abc.
        
       | pestatije wrote:
       | RLHF - reinforcement learning from human feedback
        
         | 1bent wrote:
         | Thank you!
        
         | cylon13 wrote:
         | A notable improvement over the GLHF strategy for interacting
         | with GPT models.
        
           | lcnPylGDnU4H9OF wrote:
           | (In case anybody's confused by the gaming culture reference:
           | https://en.wiktionary.org/wiki/glhf. "Good Luck Have Fun")
        
       | H8crilA wrote:
       | This says nothing on how RLHF works, but a lot on what can be the
       | results.
        
         | SleekEagle wrote:
         | You can check here for an explanation (with some helpful
         | figures) https://www.assemblyai.com/blog/the-full-story-of-
         | large-lang...
        
         | inciampati wrote:
         | Yes! I came to make the same comment.
         | 
         | It's got a catchy title but it leaves much to be resolved.
        
       | victor106 wrote:
       | Anyone here know where we can find more resources on RLHF?
       | 
       | There's been a lot written about transformer models etc., but I
       | wasn't able to find much about RLHF.
        
         | rounakdatta wrote:
         | There's also this exhaustive post from one and only Chip Huyen:
         | https://huyenchip.com/2023/05/02/rlhf.html
        
         | SleekEagle wrote:
         | My colleague wrote a couple of pieces that talk about RLHF:
         | 
         | 1. https://www.assemblyai.com/blog/the-full-story-of-large-
         | lang... (you can scroll to "What RLHF actually does to an LLM"
         | if you're already familiar with LLMs)
         | 
         | 2. https://www.assemblyai.com/blog/how-chatgpt-actually-works/
        
         | hansvm wrote:
         | It's not the first paper on the topic IIRC, but OpenAI's
         | InstructGPT paper [0] is decent and references enough other
         | material to get started.
         | 
         | The key idea is that they're able to start with large amounts
         | of relatively garbage unsupervised data (the internet), and use
         | that model to cheaply generate decent amounts of better data
         | (ranking generated content rather than spending the man-hours
         | to actually write good content). The other details aren't too
         | important.
         | 
         | [0] https://arxiv.org/abs/2203.02155
        
         | senko wrote:
         | Blog post from Huggingface: https://huggingface.co/blog/rlhf
         | 
         | Webinar on the same topic (from same HF folks):
         | https://www.youtube.com/watch?v=2MBJOuVq380&t=496s
         | 
         | RLHF as used by OpenAI in InstructGPT (predecessor to ChatGPT):
         | https://arxiv.org/abs/2203.02155 (academic paper, so much
         | denser than the above two resources)
        
           | samstave wrote:
           | It will be interesting when we have AI doing RLHF to other
           | AIs based on itself being RLHF'd and having an iterative AI
           | model reinforcement...
           | 
           | But we talk of 'hallucinations' but what we wont get is AI
           | malfeasense identified by AI RLHF trickery/lying?
        
             | z3c0 wrote:
             | This is essentially the premise behind Generative
             | Adversarial Networks, and if you've seen the results,
             | they're astounding. They're much better for specialized
             | tasks than their generalized GPT counterparts.
        
               | samstave wrote:
               | Please expand on this?
        
       | p1esk wrote:
       | Original RLHF paper: https://arxiv.org/abs/1706.03741
        
       | abscind wrote:
       | Any reason RLHF isn't just a band-aid on "not having enough
       | data?"
        
         | trade_monkey wrote:
         | RLHF is a band aid on not having enough data that fits your own
         | biases and answers you want the model to give.
        
       | wmwmwm wrote:
       | Does anyone have any insight into why reinforcement learning is
       | (maybe) required/historically favoured? There was an interesting
       | paper recently suggesting that you can use a preference learning
       | objective directly and get a similar/better result without the RL
       | machinery - but I lack the right intuition to know whether RLHF
       | offers some additional magic! Here's the " Direct Preference
       | Optimization " paper: https://arxiv.org/abs/2305.18290
        
         | fardo wrote:
         | > Does anyone have any insight into why reinforcement learning
         | is (maybe) required/historically favoured?
         | 
         | From a concept stage, it has attractive similarities to the way
         | people learn in real life (rewarded for successful learnings,
         | punished for failure), and although we know similarities to
         | nature don't guarantee better results than alternatives (for
         | example, our modern airplane does not "flap" its wings the way
         | a bird does), natural solutions will be continually looked to
         | as a starting point and tool to try on new problems.
         | 
         | Additionally, RL gives you a good start on unclear-how-to-
         | address problems. In spaces where it's not clear where to begin
         | optimizing besides taking actions and seeing how they do judged
         | against some metric, reinforcement learning often provides a
         | good mental and code framework for attacking these problems.
         | 
         | >There was a paper recently suggesting that you can use a
         | preference learning objective directly
         | 
         | Doing a very quick skim, it looks like that paper is arguing
         | rather than giving rewards or punishments based on preferences,
         | you can just build a predictive classifier for the kinds of
         | responses humans prefer. It seems interesting, though I wonder
         | the extent to which you still have to occasionally do that
         | reinforcement learning to generate relevant data for evaluating
         | the classifier.
        
         | gradys wrote:
         | My intuition on this:
         | 
         | Maximum likelihood training -> faithfully represent training
         | data
         | 
         | Reinforcement learning -> seek out the most preferred answer
         | you can
        
       ___________________________________________________________________
       (page generated 2023-06-21 23:01 UTC)