[HN Gopher] Mistral 7B Fine-Tune Optimized ___________________________________________________________________ Mistral 7B Fine-Tune Optimized Author : tosh Score : 119 points Date : 2023-12-20 19:50 UTC (3 hours ago) (HTM) web link (openpipe.ai) (TXT) w3m dump (openpipe.ai) | nickthegreek wrote: | Anytime I see a claim that our 7b models are better than gpt-4 I | basically stop reading. If you are going to make that claim, give | me several easily digestible examples of this taking place. | achille wrote: | They can absolutely outperform gpt4 for specific use cases. | nickthegreek wrote: | I am very open to believing that. I'd love to see some | examples. | GaggiX wrote: | Well it's pretty easy to find examples online, this one | using Llama 2, not even Mistral or fancy techniques: | https://www.anyscale.com/blog/fine-tuning- | llama-2-a-comprehe... | turnsout wrote: | I agree, I think they need an example or two on that blog | post to back up the claim. I'm ready to believe it, but I | need something more than "diverse customer tasks" to | understand what we're talking about. | shiftpgdn wrote: | They're quite close in arena format: | https://chat.lmsys.org/?arena | TOMDM wrote: | To be clear, Mixtral is very competitive, Mistral while | certainly way better than most 7B models performs far | worse than ChatGPT3.5 Turbo. | shiftpgdn wrote: | Apologies, that's what I get for skimming through the | thread. | bugglebeetle wrote: | You can fine-tune a small model yourself and see. GPT-4 is | an amazing general model, but won't perform the best at | every task you throw at it, out of the box. I have a fine- | tuned Mistral 7B model that outperforms GPT 4 on a specific | type of structured data extraction. Maybe if I fine-tuned | GPT-4 it could beat it, but that costs a lot of money for | what I can now do locally for the cost of electricity. | TOMDM wrote: | Yeah, a 7B foundation model is of course going to be worse | when expected to perform on every task. | | But finetuning on just a few tasks? | | Depending on the task, it's totally reasonable to expect that | a 7B model might eke out a win against stock GPT4. Especially | if there's domain knowledge in the finetune, and the given | task is light on demand for logical skills. | holoduke wrote: | Not for translations. Did a lot of experimenting different | local models. None come even a bit close to the capabilities | of chatgpt. Most local models just outputting plain wrong | intormation. I am still hoping one day it will be possible. | For our business a huge opportunity. | gmuslera wrote: | "... with my definition of better" should be the default | interpretation whenever you see the word better anywhere. | filterfiber wrote: | In their second sentence they have the most honest response | I've seen so far at least: " averaged across 4 diverse | customer tasks, fine-tunes based on our new model are | _slightly_ stronger than GPT-4, as measured by GPT-4 itself." | hospitalJail wrote: | Some things to note about gpt4: | | >Sometimes it will spit out terrible horrid answers. I believe | this might be due to time of the day/too many users. They limit | tokens. | | >Sometimes it will lie because it has alignment | | >Sometimes I feel like it tests things on me | | So, yes you are right, gpt4 is overall better, but I find | myself using local models because I stopped trusting gpt4. | moffkalast wrote: | How are local models better in terms of trust? GPT 4 is the | only model I've seen actually tuned to say no when it doesn't | have the information being asked for. Though I do agree it | used to run better earlier this year. | | The best open source has to offer is Mixtral that will | confidently make up a biography of a person it's never heard | of before or write a script with nonexistant libraries. | mattkevan wrote: | I once asked Llama whether it'd heard of me. It came back | with such a startlingly detailed and convincing biography | of someone almost but not quite entirely unlike me that I | began to wonder if there was some kind of Sliding Doors | alternate reality thing going on. | | Some of the things it said I'd done were genuinely good | ideas, and I might actually go and do them at some point. | | ChatGPT just said no. | crooked-v wrote: | Don't forget that ChatGPT 4 also has seasonal depression [1]. | | [1]: | https://twitter.com/RobLynch99/status/1734278713762549970 | | (Though with that said, the seasonal issue might be common to | any LLM with training data annotated by time of year.) | tomrod wrote: | Looks like they utilized the Bradley-Terry model, but that's | not one I'm super familiar with. | | https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model | huac wrote: | the BTL model is just a way to infer 'true' skill levels | given some list of head to head comparisons. the head to head | comparisons / rankings are the most important!!!! and in this | case, the rankings come from GPT-4 itself. so take any | subsequent score with all the grains of salt you can muster. | | their methodology also appears to be 'try 12 different models | and hope 1 of them wins out.' multiple hypothesis adjustments | come to mind here :) | brucethemoose2 wrote: | IDK about GPT4 specifically, but I have recently witnessed a | case where small finetuned 7Bs greatly outperformed larger | models (Mixtral Instruct, Llama 70B finetunes) in a few very | specific tasks. | | There is nothing unreasonable about this. However I do dislike | it when that information is presented in a fishy way, implying | that it "outperforms GPT4" without any qualification. | jug wrote: | What I think they're claiming is that it's a base model aimed | for further fine tuning, that when further tuned might perform | better than GPT-4 on certain tasks. | | It's an argument they make at least as much to market fine | tuning as their own model. | | This is not a generic model that outperforms another generic | model (GPT-4). | | That can of course have useful applications because the | resource/cost is then comparatively minuscule for certain | business use cases. | thorum wrote: | Anecdotally, I finetuned Mistral 7B for a specific (and | slightly unusual) natural language processing task just a few | days ago. GPT-4 can do the task, but it needs a long complex | prompt and only gets it right about 80-90% of the time - the | finetuned model performs significantly better with fewer | tokens. (In fact it does so well that I suspect I could get | good results with an even smaller model.) | oceanplexian wrote: | I have a fine tuned version of Mistral doing a really simple | task and spitting out some JSON. I'm getting equivalent | performance to GPT-4 on that specialized task. It's lower | latency, it's outputting more tokens/sec., more reliable, | private, and completely free. | | I don't think we will have an Open Source GPT4 for a long | time so this is sorta clickbait, but for the small, | specialized tasks, tuned on high quality data, we are already | in the "Linux" era of OSS models. They can do real, practical | work. | mistercheph wrote: | https://chat.lmsys.org/?arena | | Try a few blinds, mixtral 8x7b-instruct and gpt-4 are 50-50 for | me, and it outperforms 3.5 almost every time, and you can run | inference on it with a modern cpu and 64 GB of RAM on a | personal device lmfao. and the instruct finetuning has had | nowhere near the $$$ and rlhf that openai has. It's not a done | deal, but people will be able to run models better than today's | SOTA on <$1000 hardware in <3 months, I hope for their own sake | that OpenAI is moving fast. | kcorbitt wrote: | (Post author here). Totally fair concern. I'll find some | representative examples on a sample task we've done some fine- | tuning on and add them to the post. | | EDIT: Ok so the prompt and outputs are long enough that adding | them to the post directly would be kind of onerous. But I | didn't want to leave you waiting, so I copied an example into a | Notion doc you can see here: https://opipe.notion.site/PII- | Redaction-Example-ebfd29939d25... | xrd wrote: | We've tried to sell variants of the open source models to our | existing enterprise customers. | | I think the adage about "a solution needs to be 10x other | solutions to make someone switch" applies here. | | Saying something performs slightly better than the industry | standard offerings (OpenAI) means that OpenAI is going to laugh | all the way to the bank. Everyone will just use their APIs over | anything else. | | I'm excited about the LLM space and I can barely keep up with the | model names, much less all the techniques for fine tuning. A | customer is going to have an even worse time. | | No one will ever get fired for buying OpenAI (now that IBM is | dead, and probably sad Watson never made a dent). | | I do use Mistral for all my personal projects but I'm not sure | that is going to have the same effect on the industry as open | source software did in the past. | turnsout wrote: | There's a lot of truth to this, but I have seen clients get | really interested in local models--mostly due to cost and/or | confidentiality. For example, some healthcare clients will | never upload medical records to OpenAI, regardless of the | enterprise agreement. | esafak wrote: | OpenAI is nothing like IBM in its heyday. I bet a very healthy | proportion of companies will not share their data with OpenAI. | I saw some numbers on this a while back I don't have the link | handy. Trust has to be earned. | bugglebeetle wrote: | The problem here is that the platform offering here is overly | complicated to get started with and quite limited. 2000 dataset | entries for $50 a month when I can do 10x as many as that on | Colab for free with axolotl or unsloth? Yeah, no thanks. | wavemode wrote: | > I think the adage about "a solution needs to be 10x other | solutions to make someone switch" applies here. | | Cheaper and faster is also better. The cheapest version of | GPT-4 costs $0.01/$0.03 per 1K input/output tokens [1]. Mistral | AI is charging 0.14EUR/0.42EUR per ONE MILLION input/output | tokens for their 7B model [2]. It's night and day. | | If people can start fine-tuning a 7B model to do the same work | they were doing with GPT-4, they will 100% switch. | | [1]: https://help.openai.com/en/articles/7127956-how-much-does- | gp... | | [2]: https://docs.mistral.ai/platform/pricing/ | oceanplexian wrote: | > I think the adage about "a solution needs to be 10x other | solutions to make someone switch" applies here. | | It's already superior to OpenAI because it doesn't require an | API. You can run the model on your own hardware, in your own | datacenter, and your data is guaranteed to remain confidential. | Creating a one-off fine-tune is a different story than | permanently joining your company at the hip to OpenAI. | | I know in our bubble, in the era of Cloud, it's easy to send | confidential company data to some random API on the Internet | and not worry about it, but that's absolutely not the case for | anyone in Healthcare, Government, or even normal companies that | are security conscious. For them, OpenAI was never a valid | consideration in the first place. | moneywoes wrote: | what is the most prominent use case for private LLMs, doctor | notes? | miohtama wrote: | Anything related to the business or medium and large | enterprises, government | noitpmeder wrote: | Definitely healthcare, or for certain industries | (HFT/Finance/...) where for various reasons _everything_ | must be run on prem. | sergiotapia wrote: | You could use it to query against any kind of B2B customer | information and provide insight, citations and context | without any of the data leaving your private server. | | When building something similar powered by OpenAI I had a | real pain in the ass anonymizing the data, then de- | anonymizing the answers before showing it to the customer. | | Also in my example, I'm sure using a string like "Pineapple | Cave Inc." instead of the real business name hurt the AI's | ability to contextualize the information and data and that | hurt the LLM somewhat -- right? | bbor wrote: | Great answers above, but long term: Personal assistants. I | truly think that's a privacy line people won't cross, even | after seeing Alexa and Google Maps enter into our lives; I | think people would rather have nothing than a robot that | knows every detail of their health, schedule, feelings, | plans, etc. in some vaguely defined server somewhere. | tomduncalf wrote: | Don't Google already have that information from your | searches, emails, calendar, etc? Obviously you have to | trust they don't misuse it, but it's basically the same | thing as some personal assistant having it to me. | bbor wrote: | Yeah, but I think this is less of a technical line than | an emotional one. | | For example: I wanted my personal assistant to track | hygiene, which is a natural use case. But then you arrive | at the natural conclusion that either a) the user needs | to enter the data themselves ("I brushed my teeth and | washed my face and took X medications at Y time"), or b) | you need some sort of sensor in the bathroom, ranging | from mics or radio sensors up to a tasteful camera. And a | million subtle versions of (b) is where I see people | going "no, that's weird, it's too much info all together" | mrinterweb wrote: | Proprietary and sensitive information. Personally, I use a | self-hosted LLM because I don't trust how my conversations | with hosted generative AI services will be used. | fo76yo wrote: | Personalized metaspaces, game worlds, content without | paying a rent seeker copyright holder. | | Education and research without gatekeepers in academia and | industry complaining about their book sales or prestige | titles being obsoleted | | Whole lot of uses cases that break us out of having to | kowtow to experts who were merely born before us trying to | monopolize exploration of science and technology | | To that end I'm working on a GPU accelerated client backed | by local AI, with NERFs and Gaussian splatting built in. | | The upside to being an EE with MSc in math; most of my | money comes from engineering real things. I don't have skin | in the cloud CRUD app/API game and don't see a reason to | spend money propping up middle men who, given my skills and | abilities, don't add value | | Programmers can go explore syntax art in their parent's | basement again. Tired of 1970s semantics and everyone with | a DSL thinking that's the best thing to happen to computing | as a field of inquiry ever. | | Like all industries big tech is monopolized by aging rent | seekers. Disrupt by divesting from it is my play now. | moneywoes wrote: | how are you using it for your project? | kcorbitt wrote: | Hey, I'm the post author. This is a totally fair point! I do | think though that depending on your specific requirements open- | source models _can_ be a 10x+ improvement. For example, we | serve Mistral 7B for less than 1 /10th the cost of GPT-4-Turbo, | which is the model most of our users are comparing us to. | xrd wrote: | This is the 10x I was looking for. Great post by the way! | jdwyah wrote: | The real thing is the switching costs. Sure we start with | openAI. But at some hackathon in 9 months somebody will try | mistral and if that saves real money and still works it feels | like any easy swap. | Joeri wrote: | Actually, I think microsoft is going to laugh all the way to | the bank, because probably most enterprises will use the Azure | OpenAI service instead of directly buying OpenAI's offerings. | ren_engineer wrote: | all they need is an API compatible client library so there is | no actual switching cost between models other than | configuration. There's a reason OpenAI is adding all sorts of | add-on features like assistants and file upload, because they | know models themselves are going to be a commodity and they | need something to lock developers on their platform | mmcwilliams wrote: | I think at this point the "10x other solutions" should be | measured for the cost. If I can process, in perpetuity, 100s of | millions of tokens for the cost that OpenAI can do for 10s of | millions of tokens one time, that is already past the | threshold. | gpjanik wrote: | I want an interactive prompt box with some example prompts and | answers from the model and a comparison with GPT-4. My random | guess is that this finetuned Mistral-7B is better at nothing or | almost nothing than GPT-4 and that's why instead of the above, we | got a table with a bunch of irrelevant metrics. | boredumb wrote: | Of course mistral7b is worse than GPT-4, but I can run | mistral-7b at home. | gpjanik wrote: | The point is that the article states "averaged across 4 | diverse customer tasks, fine-tunes based on our new model are | slightly stronger than GPT-4, as measured by GPT-4 itself" | and then proves it with nothing tangible, just the 4 selected | metrics where it performs the best. I mean obviously a | finetuned 7B LLM could perform, let's say, text summarization | well. The question is what happens if that text contains | code, domain-specific knowledge where some facts are less | relevant than the other, etc., and that isn't going to be | answered by any metric alone. Fundamentally, with enough | diverse metrics, each based on a different dataset, the one | with the biggest overlap of the dataset for finetuning will | perform really well, and the rest, well, not so well. | | Bsically, the statistic means that there's a set of data for | which that particular (finetuned) network performs slightly | better than GPT-4, and everywhere else, pretty bad. It's just | not generalizable to everything while GPT-4 is. It's just as | good as saying "calculators outperform GPT-4 at counting". | Like, yes, they probably do, but I would like to see - is it | applicable and practical, or did you just train a LLM to | write all the names in Polish alphabetically really well? And | that's why qualitative approach for evaluation LLMs is just | better. | avereveard wrote: | not a bad model, becomes incoherent at above 8k token, and it's | not helped by the fact that's very verbose, but seems very | coherent and stay on topic closely until then: | https://chat.openai.com/share/089d1b8c-3467-4c01-af9f-6568c0... | | fails at math of course, even if the problem is very easy, like | all mistrals. good for genration, probably not the best for RAG, | there's mistral tunes that stay coherent to 16k tokens, and that | cuts down chunking significanty ___________________________________________________________________ (page generated 2023-12-20 23:00 UTC)