[HN Gopher] Smarter summaries with finetuning GPT-3.5 and chain ...
       ___________________________________________________________________
        
       Smarter summaries with finetuning GPT-3.5 and chain of density
        
       Author : ivanleomk
       Score  : 134 points
       Date   : 2023-11-13 16:12 UTC (6 hours ago)
        
 (HTM) web link (jxnl.github.io)
 (TXT) w3m dump (jxnl.github.io)
        
       | huac wrote:
       | nice work! generating good example data is the most important
       | part of finetuning.
       | 
       | imo summarization is also a fairly simple task -- I wouldn't be
       | surprised if a fine-tuned open source model (eg llama 13 /
       | mistral 7) would get to similar performance.
        
         | jxnlco wrote:
         | for sure! the one thing i was surprised by was how little data
         | gpt3.5 needed, could love for a company to try how the scaling
         | laws work for those smaller models.
        
         | robbomacrae wrote:
         | I find that bart large 410m [0] parameters does a fine job at
         | summarizing. In Summer AI I alternate between using a copy of
         | that bart large getting hyper-trained on feedback and Chat GPT
         | 3.3 and honestly I don't have a preference between the results.
         | 
         | However, thanks to this article I might revisit the
         | summarization techniques used a try a fine tuned 3.5.
         | 
         | It would be great to see these techniques compared to Chat GPT
         | 4 Turbo.
         | 
         | [0]: https://huggingface.co/facebook/bart-large-cnn
        
       | themonk911 wrote:
       | Gotta admit I spent some time thinking this was a new technique
       | called 'chain of _destiny_ ' and was reading through the article
       | trying to work out what kind of fate-based prompt engineering was
       | happening.
        
         | intelVISA wrote:
         | Did the exact same thing :)
        
         | mpalmer wrote:
         | https://m.youtube.com/watch?v=jGxuWWGo8AY&t=9
        
         | rzzzt wrote:
         | It's a forgotten Wolfenstein sequel!
        
       | Der_Einzige wrote:
       | One of the fun parts of AI is finding out that abstractive
       | summarization is "easy", but extractive summarization (which is
       | what humans do far more often in practice) is still very hard.
       | Partly because most datasets assume sentence level extractive
       | summarization, which is often not how humans summarize documents.
       | 
       | There's still tons of very low hanging fruit in the summarization
       | work. I'm not aware of significant followup work to pointer
       | networks besides pointer-generator networks, which these days are
       | considered old news. Pointer based architectures are the ideal
       | system for word level extractive summarizers, yet the very best
       | extractive summarization systems today are usually nothing more
       | than sentence selectors using some kinds of embeddings and cosine
       | similarity.
       | 
       | Happy to see such success with abstractive summaries, but the
       | kind that myself and most other humans are interested in is still
       | far from solved.
        
         | msp26 wrote:
         | Could you point me to more reading on extractive summarisation?
         | A lot of what I see feels out of date compared to what should
         | be possible now with LLMs.
        
       | esafak wrote:
       | Those repeated calls sound like a good way to rack up a bill and
       | incur a high latency.
        
         | jxnlco wrote:
         | right which is why finetuning on the last one is a great save
         | but preserves quality
        
       | jph00 wrote:
       | Minor correction: the article describes Chain of Density as
       | "First introduced by Salesforce's AI Research wing" -- however
       | the 1st author (who is a PhD student) and senior author are both
       | at Columbia; only one of the 5 authors is at Salesforce.
        
         | hackernewds wrote:
         | prepared to see all these companies "invent" these techniques.
         | fwiw people believe OpenAI "invented" chatgpt, whereas the
         | inventors of the transformer model individually worked at
         | competing companies during the research (Google Brain) and
         | presently founded competing companies now.
        
           | vinni2 wrote:
           | The novelty of chatgpt was instruction tuning of transformers
           | using reinforcement learning with human feedback and finding
           | right dataset as well as annotations for it. before this
           | transformers were good for some tasks but not so good for
           | generating text. Even though OpenAI didn't invent
           | transformers they did invent the technique needed to make
           | chatgpt possible.
        
         | jxnlco wrote:
         | I'll fix this now!
        
       | sandGorgon wrote:
       | has anyone finetuned gpt 3.5 or llama, etc using their private
       | data ? what is the best practice to generate training data.
       | 
       | one way i have heard is to send a chunk of data to gpt4 and ask
       | for questions to be generated. unsure of other ways. what has
       | worked well ?
        
         | vjb2tq4dws wrote:
         | here is an example on how to generate synthetic data that you
         | can adapt for your case
         | https://dzlab.github.io/2023/09/22/palm-synthetic-data/
        
           | just_boost_it wrote:
           | Is this proven to work? ML models are usually trained to
           | learn a model of the environment by giving them environment
           | data. I would have expected feeding it model outputs just
           | trains it to learn a model of the model creating the data.
           | 
           | Without seeing some kind of demonstration otherwise, my
           | feeling is that it would be like regressing stock price on
           | inflation, then trying to generate more data using the
           | regression model and random inflation numbers. All you'd
           | learn is the model that you put in to generate the data.
        
             | valine wrote:
             | I'd think of it less like teaching the model something new,
             | and more like enforcing a behavior the model can already
             | output. Any decent raw model can output function names and
             | parameters with prompt engineering. To do function calling,
             | you need the model to output function names reliably for a
             | wide variety of prompts. That's where the fine-tuning comes
             | in.
        
               | just_boost_it wrote:
               | I could very easily believe that if I saw proof, but it
               | just feels a bit wrong to train a model on model outputs.
               | 
               | Even in the main article here, the model did better with
               | fewer fine tuned examples. To us, the auto-generated
               | examples might look different enough and might look good
               | enough, but they were all generated algorithmically.
               | Feeding more examples in might easily be leading it to
               | focus on some artifact of the embeddings or generating
               | model that we just don't perceive.
        
               | visarga wrote:
               | > it just feels a bit wrong to train a model on model
               | outputs
               | 
               | If you have a small student model and a large teacher it
               | makes sense, the student is better off after this
               | distillation.
               | 
               | If you have a way to filter out low quality synthetic
               | examples then it would be useful to generate a bunch more
               | and take the best.
               | 
               | If your LLM is an agent, then it can generate feedback
               | signals from the environment. Even a human-AI chat is a
               | form of environment for the model. Every human response
               | can be evaluated as positive or negative reward.
               | 
               | More fundamentally, organic datasets are very unbalanced,
               | LLMs need more complex reasoning chains than what is
               | usually available. There are some exceptions - in
               | scientific papers, manuals and code you get very complex
               | reasoning chains. But not in general. This issue can be
               | fixed with synthetic data.
               | 
               | And even in principle, if you have a model at level N and
               | want to make a dataset at level N+1, then you need to
               | boost your model. You can give it more tokens, more
               | attempts or more tools.
        
         | SubiculumCode wrote:
         | If its a small amount of data, it seems RAG pieplines are
         | better. is all I think I know.
        
       | tobbe2064 wrote:
       | Am i reading it right that the fine tune a model using 20
       | examples and 5 epochs? That seems really weird for me
        
         | isoprophlex wrote:
         | Can't overfit when your learning rate is zero! _insert smart
         | thinking meme_
        
         | riku_iki wrote:
         | LLMs are few shots learners, that's why many people put
         | examples into prompt, this is the next step.
        
           | ed wrote:
           | I don't believe few shot performance dictates how quickly you
           | can fine-tune.
           | 
           | Most fine tunes will have much larger datasets (I am under
           | the impression you want 10's of thousands of examples for
           | most runs).
           | 
           | So I'm similarly impressed 20 examples would make such a big
           | difference.
           | 
           | But also note entity density decreases as example count
           | increases. This is counterintuitive -- maybe something else
           | is going on here?
        
       ___________________________________________________________________
       (page generated 2023-11-13 23:00 UTC)