[HN Gopher] Smarter summaries with finetuning GPT-3.5 and chain ... ___________________________________________________________________ Smarter summaries with finetuning GPT-3.5 and chain of density Author : ivanleomk Score : 134 points Date : 2023-11-13 16:12 UTC (6 hours ago) (HTM) web link (jxnl.github.io) (TXT) w3m dump (jxnl.github.io) | huac wrote: | nice work! generating good example data is the most important | part of finetuning. | | imo summarization is also a fairly simple task -- I wouldn't be | surprised if a fine-tuned open source model (eg llama 13 / | mistral 7) would get to similar performance. | jxnlco wrote: | for sure! the one thing i was surprised by was how little data | gpt3.5 needed, could love for a company to try how the scaling | laws work for those smaller models. | robbomacrae wrote: | I find that bart large 410m [0] parameters does a fine job at | summarizing. In Summer AI I alternate between using a copy of | that bart large getting hyper-trained on feedback and Chat GPT | 3.3 and honestly I don't have a preference between the results. | | However, thanks to this article I might revisit the | summarization techniques used a try a fine tuned 3.5. | | It would be great to see these techniques compared to Chat GPT | 4 Turbo. | | [0]: https://huggingface.co/facebook/bart-large-cnn | themonk911 wrote: | Gotta admit I spent some time thinking this was a new technique | called 'chain of _destiny_ ' and was reading through the article | trying to work out what kind of fate-based prompt engineering was | happening. | intelVISA wrote: | Did the exact same thing :) | mpalmer wrote: | https://m.youtube.com/watch?v=jGxuWWGo8AY&t=9 | rzzzt wrote: | It's a forgotten Wolfenstein sequel! | Der_Einzige wrote: | One of the fun parts of AI is finding out that abstractive | summarization is "easy", but extractive summarization (which is | what humans do far more often in practice) is still very hard. | Partly because most datasets assume sentence level extractive | summarization, which is often not how humans summarize documents. | | There's still tons of very low hanging fruit in the summarization | work. I'm not aware of significant followup work to pointer | networks besides pointer-generator networks, which these days are | considered old news. Pointer based architectures are the ideal | system for word level extractive summarizers, yet the very best | extractive summarization systems today are usually nothing more | than sentence selectors using some kinds of embeddings and cosine | similarity. | | Happy to see such success with abstractive summaries, but the | kind that myself and most other humans are interested in is still | far from solved. | msp26 wrote: | Could you point me to more reading on extractive summarisation? | A lot of what I see feels out of date compared to what should | be possible now with LLMs. | esafak wrote: | Those repeated calls sound like a good way to rack up a bill and | incur a high latency. | jxnlco wrote: | right which is why finetuning on the last one is a great save | but preserves quality | jph00 wrote: | Minor correction: the article describes Chain of Density as | "First introduced by Salesforce's AI Research wing" -- however | the 1st author (who is a PhD student) and senior author are both | at Columbia; only one of the 5 authors is at Salesforce. | hackernewds wrote: | prepared to see all these companies "invent" these techniques. | fwiw people believe OpenAI "invented" chatgpt, whereas the | inventors of the transformer model individually worked at | competing companies during the research (Google Brain) and | presently founded competing companies now. | vinni2 wrote: | The novelty of chatgpt was instruction tuning of transformers | using reinforcement learning with human feedback and finding | right dataset as well as annotations for it. before this | transformers were good for some tasks but not so good for | generating text. Even though OpenAI didn't invent | transformers they did invent the technique needed to make | chatgpt possible. | jxnlco wrote: | I'll fix this now! | sandGorgon wrote: | has anyone finetuned gpt 3.5 or llama, etc using their private | data ? what is the best practice to generate training data. | | one way i have heard is to send a chunk of data to gpt4 and ask | for questions to be generated. unsure of other ways. what has | worked well ? | vjb2tq4dws wrote: | here is an example on how to generate synthetic data that you | can adapt for your case | https://dzlab.github.io/2023/09/22/palm-synthetic-data/ | just_boost_it wrote: | Is this proven to work? ML models are usually trained to | learn a model of the environment by giving them environment | data. I would have expected feeding it model outputs just | trains it to learn a model of the model creating the data. | | Without seeing some kind of demonstration otherwise, my | feeling is that it would be like regressing stock price on | inflation, then trying to generate more data using the | regression model and random inflation numbers. All you'd | learn is the model that you put in to generate the data. | valine wrote: | I'd think of it less like teaching the model something new, | and more like enforcing a behavior the model can already | output. Any decent raw model can output function names and | parameters with prompt engineering. To do function calling, | you need the model to output function names reliably for a | wide variety of prompts. That's where the fine-tuning comes | in. | just_boost_it wrote: | I could very easily believe that if I saw proof, but it | just feels a bit wrong to train a model on model outputs. | | Even in the main article here, the model did better with | fewer fine tuned examples. To us, the auto-generated | examples might look different enough and might look good | enough, but they were all generated algorithmically. | Feeding more examples in might easily be leading it to | focus on some artifact of the embeddings or generating | model that we just don't perceive. | visarga wrote: | > it just feels a bit wrong to train a model on model | outputs | | If you have a small student model and a large teacher it | makes sense, the student is better off after this | distillation. | | If you have a way to filter out low quality synthetic | examples then it would be useful to generate a bunch more | and take the best. | | If your LLM is an agent, then it can generate feedback | signals from the environment. Even a human-AI chat is a | form of environment for the model. Every human response | can be evaluated as positive or negative reward. | | More fundamentally, organic datasets are very unbalanced, | LLMs need more complex reasoning chains than what is | usually available. There are some exceptions - in | scientific papers, manuals and code you get very complex | reasoning chains. But not in general. This issue can be | fixed with synthetic data. | | And even in principle, if you have a model at level N and | want to make a dataset at level N+1, then you need to | boost your model. You can give it more tokens, more | attempts or more tools. | SubiculumCode wrote: | If its a small amount of data, it seems RAG pieplines are | better. is all I think I know. | tobbe2064 wrote: | Am i reading it right that the fine tune a model using 20 | examples and 5 epochs? That seems really weird for me | isoprophlex wrote: | Can't overfit when your learning rate is zero! _insert smart | thinking meme_ | riku_iki wrote: | LLMs are few shots learners, that's why many people put | examples into prompt, this is the next step. | ed wrote: | I don't believe few shot performance dictates how quickly you | can fine-tune. | | Most fine tunes will have much larger datasets (I am under | the impression you want 10's of thousands of examples for | most runs). | | So I'm similarly impressed 20 examples would make such a big | difference. | | But also note entity density decreases as example count | increases. This is counterintuitive -- maybe something else | is going on here? ___________________________________________________________________ (page generated 2023-11-13 23:00 UTC)