[HN Gopher] Mistral 7B Fine-Tune Optimized
       ___________________________________________________________________
        
       Mistral 7B Fine-Tune Optimized
        
       Author : tosh
       Score  : 119 points
       Date   : 2023-12-20 19:50 UTC (3 hours ago)
        
 (HTM) web link (openpipe.ai)
 (TXT) w3m dump (openpipe.ai)
        
       | nickthegreek wrote:
       | Anytime I see a claim that our 7b models are better than gpt-4 I
       | basically stop reading. If you are going to make that claim, give
       | me several easily digestible examples of this taking place.
        
         | achille wrote:
         | They can absolutely outperform gpt4 for specific use cases.
        
           | nickthegreek wrote:
           | I am very open to believing that. I'd love to see some
           | examples.
        
             | GaggiX wrote:
             | Well it's pretty easy to find examples online, this one
             | using Llama 2, not even Mistral or fancy techniques:
             | https://www.anyscale.com/blog/fine-tuning-
             | llama-2-a-comprehe...
        
             | turnsout wrote:
             | I agree, I think they need an example or two on that blog
             | post to back up the claim. I'm ready to believe it, but I
             | need something more than "diverse customer tasks" to
             | understand what we're talking about.
        
             | shiftpgdn wrote:
             | They're quite close in arena format:
             | https://chat.lmsys.org/?arena
        
               | TOMDM wrote:
               | To be clear, Mixtral is very competitive, Mistral while
               | certainly way better than most 7B models performs far
               | worse than ChatGPT3.5 Turbo.
        
               | shiftpgdn wrote:
               | Apologies, that's what I get for skimming through the
               | thread.
        
             | bugglebeetle wrote:
             | You can fine-tune a small model yourself and see. GPT-4 is
             | an amazing general model, but won't perform the best at
             | every task you throw at it, out of the box. I have a fine-
             | tuned Mistral 7B model that outperforms GPT 4 on a specific
             | type of structured data extraction. Maybe if I fine-tuned
             | GPT-4 it could beat it, but that costs a lot of money for
             | what I can now do locally for the cost of electricity.
        
           | TOMDM wrote:
           | Yeah, a 7B foundation model is of course going to be worse
           | when expected to perform on every task.
           | 
           | But finetuning on just a few tasks?
           | 
           | Depending on the task, it's totally reasonable to expect that
           | a 7B model might eke out a win against stock GPT4. Especially
           | if there's domain knowledge in the finetune, and the given
           | task is light on demand for logical skills.
        
           | holoduke wrote:
           | Not for translations. Did a lot of experimenting different
           | local models. None come even a bit close to the capabilities
           | of chatgpt. Most local models just outputting plain wrong
           | intormation. I am still hoping one day it will be possible.
           | For our business a huge opportunity.
        
         | gmuslera wrote:
         | "... with my definition of better" should be the default
         | interpretation whenever you see the word better anywhere.
        
           | filterfiber wrote:
           | In their second sentence they have the most honest response
           | I've seen so far at least: " averaged across 4 diverse
           | customer tasks, fine-tunes based on our new model are
           | _slightly_ stronger than GPT-4, as measured by GPT-4 itself."
        
         | hospitalJail wrote:
         | Some things to note about gpt4:
         | 
         | >Sometimes it will spit out terrible horrid answers. I believe
         | this might be due to time of the day/too many users. They limit
         | tokens.
         | 
         | >Sometimes it will lie because it has alignment
         | 
         | >Sometimes I feel like it tests things on me
         | 
         | So, yes you are right, gpt4 is overall better, but I find
         | myself using local models because I stopped trusting gpt4.
        
           | moffkalast wrote:
           | How are local models better in terms of trust? GPT 4 is the
           | only model I've seen actually tuned to say no when it doesn't
           | have the information being asked for. Though I do agree it
           | used to run better earlier this year.
           | 
           | The best open source has to offer is Mixtral that will
           | confidently make up a biography of a person it's never heard
           | of before or write a script with nonexistant libraries.
        
             | mattkevan wrote:
             | I once asked Llama whether it'd heard of me. It came back
             | with such a startlingly detailed and convincing biography
             | of someone almost but not quite entirely unlike me that I
             | began to wonder if there was some kind of Sliding Doors
             | alternate reality thing going on.
             | 
             | Some of the things it said I'd done were genuinely good
             | ideas, and I might actually go and do them at some point.
             | 
             | ChatGPT just said no.
        
           | crooked-v wrote:
           | Don't forget that ChatGPT 4 also has seasonal depression [1].
           | 
           | [1]:
           | https://twitter.com/RobLynch99/status/1734278713762549970
           | 
           | (Though with that said, the seasonal issue might be common to
           | any LLM with training data annotated by time of year.)
        
         | tomrod wrote:
         | Looks like they utilized the Bradley-Terry model, but that's
         | not one I'm super familiar with.
         | 
         | https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model
        
           | huac wrote:
           | the BTL model is just a way to infer 'true' skill levels
           | given some list of head to head comparisons. the head to head
           | comparisons / rankings are the most important!!!! and in this
           | case, the rankings come from GPT-4 itself. so take any
           | subsequent score with all the grains of salt you can muster.
           | 
           | their methodology also appears to be 'try 12 different models
           | and hope 1 of them wins out.' multiple hypothesis adjustments
           | come to mind here :)
        
         | brucethemoose2 wrote:
         | IDK about GPT4 specifically, but I have recently witnessed a
         | case where small finetuned 7Bs greatly outperformed larger
         | models (Mixtral Instruct, Llama 70B finetunes) in a few very
         | specific tasks.
         | 
         | There is nothing unreasonable about this. However I do dislike
         | it when that information is presented in a fishy way, implying
         | that it "outperforms GPT4" without any qualification.
        
         | jug wrote:
         | What I think they're claiming is that it's a base model aimed
         | for further fine tuning, that when further tuned might perform
         | better than GPT-4 on certain tasks.
         | 
         | It's an argument they make at least as much to market fine
         | tuning as their own model.
         | 
         | This is not a generic model that outperforms another generic
         | model (GPT-4).
         | 
         | That can of course have useful applications because the
         | resource/cost is then comparatively minuscule for certain
         | business use cases.
        
         | thorum wrote:
         | Anecdotally, I finetuned Mistral 7B for a specific (and
         | slightly unusual) natural language processing task just a few
         | days ago. GPT-4 can do the task, but it needs a long complex
         | prompt and only gets it right about 80-90% of the time - the
         | finetuned model performs significantly better with fewer
         | tokens. (In fact it does so well that I suspect I could get
         | good results with an even smaller model.)
        
           | oceanplexian wrote:
           | I have a fine tuned version of Mistral doing a really simple
           | task and spitting out some JSON. I'm getting equivalent
           | performance to GPT-4 on that specialized task. It's lower
           | latency, it's outputting more tokens/sec., more reliable,
           | private, and completely free.
           | 
           | I don't think we will have an Open Source GPT4 for a long
           | time so this is sorta clickbait, but for the small,
           | specialized tasks, tuned on high quality data, we are already
           | in the "Linux" era of OSS models. They can do real, practical
           | work.
        
         | mistercheph wrote:
         | https://chat.lmsys.org/?arena
         | 
         | Try a few blinds, mixtral 8x7b-instruct and gpt-4 are 50-50 for
         | me, and it outperforms 3.5 almost every time, and you can run
         | inference on it with a modern cpu and 64 GB of RAM on a
         | personal device lmfao. and the instruct finetuning has had
         | nowhere near the $$$ and rlhf that openai has. It's not a done
         | deal, but people will be able to run models better than today's
         | SOTA on <$1000 hardware in <3 months, I hope for their own sake
         | that OpenAI is moving fast.
        
         | kcorbitt wrote:
         | (Post author here). Totally fair concern. I'll find some
         | representative examples on a sample task we've done some fine-
         | tuning on and add them to the post.
         | 
         | EDIT: Ok so the prompt and outputs are long enough that adding
         | them to the post directly would be kind of onerous. But I
         | didn't want to leave you waiting, so I copied an example into a
         | Notion doc you can see here: https://opipe.notion.site/PII-
         | Redaction-Example-ebfd29939d25...
        
       | xrd wrote:
       | We've tried to sell variants of the open source models to our
       | existing enterprise customers.
       | 
       | I think the adage about "a solution needs to be 10x other
       | solutions to make someone switch" applies here.
       | 
       | Saying something performs slightly better than the industry
       | standard offerings (OpenAI) means that OpenAI is going to laugh
       | all the way to the bank. Everyone will just use their APIs over
       | anything else.
       | 
       | I'm excited about the LLM space and I can barely keep up with the
       | model names, much less all the techniques for fine tuning. A
       | customer is going to have an even worse time.
       | 
       | No one will ever get fired for buying OpenAI (now that IBM is
       | dead, and probably sad Watson never made a dent).
       | 
       | I do use Mistral for all my personal projects but I'm not sure
       | that is going to have the same effect on the industry as open
       | source software did in the past.
        
         | turnsout wrote:
         | There's a lot of truth to this, but I have seen clients get
         | really interested in local models--mostly due to cost and/or
         | confidentiality. For example, some healthcare clients will
         | never upload medical records to OpenAI, regardless of the
         | enterprise agreement.
        
         | esafak wrote:
         | OpenAI is nothing like IBM in its heyday. I bet a very healthy
         | proportion of companies will not share their data with OpenAI.
         | I saw some numbers on this a while back I don't have the link
         | handy. Trust has to be earned.
        
         | bugglebeetle wrote:
         | The problem here is that the platform offering here is overly
         | complicated to get started with and quite limited. 2000 dataset
         | entries for $50 a month when I can do 10x as many as that on
         | Colab for free with axolotl or unsloth? Yeah, no thanks.
        
         | wavemode wrote:
         | > I think the adage about "a solution needs to be 10x other
         | solutions to make someone switch" applies here.
         | 
         | Cheaper and faster is also better. The cheapest version of
         | GPT-4 costs $0.01/$0.03 per 1K input/output tokens [1]. Mistral
         | AI is charging 0.14EUR/0.42EUR per ONE MILLION input/output
         | tokens for their 7B model [2]. It's night and day.
         | 
         | If people can start fine-tuning a 7B model to do the same work
         | they were doing with GPT-4, they will 100% switch.
         | 
         | [1]: https://help.openai.com/en/articles/7127956-how-much-does-
         | gp...
         | 
         | [2]: https://docs.mistral.ai/platform/pricing/
        
         | oceanplexian wrote:
         | > I think the adage about "a solution needs to be 10x other
         | solutions to make someone switch" applies here.
         | 
         | It's already superior to OpenAI because it doesn't require an
         | API. You can run the model on your own hardware, in your own
         | datacenter, and your data is guaranteed to remain confidential.
         | Creating a one-off fine-tune is a different story than
         | permanently joining your company at the hip to OpenAI.
         | 
         | I know in our bubble, in the era of Cloud, it's easy to send
         | confidential company data to some random API on the Internet
         | and not worry about it, but that's absolutely not the case for
         | anyone in Healthcare, Government, or even normal companies that
         | are security conscious. For them, OpenAI was never a valid
         | consideration in the first place.
        
           | moneywoes wrote:
           | what is the most prominent use case for private LLMs, doctor
           | notes?
        
             | miohtama wrote:
             | Anything related to the business or medium and large
             | enterprises, government
        
             | noitpmeder wrote:
             | Definitely healthcare, or for certain industries
             | (HFT/Finance/...) where for various reasons _everything_
             | must be run on prem.
        
             | sergiotapia wrote:
             | You could use it to query against any kind of B2B customer
             | information and provide insight, citations and context
             | without any of the data leaving your private server.
             | 
             | When building something similar powered by OpenAI I had a
             | real pain in the ass anonymizing the data, then de-
             | anonymizing the answers before showing it to the customer.
             | 
             | Also in my example, I'm sure using a string like "Pineapple
             | Cave Inc." instead of the real business name hurt the AI's
             | ability to contextualize the information and data and that
             | hurt the LLM somewhat -- right?
        
             | bbor wrote:
             | Great answers above, but long term: Personal assistants. I
             | truly think that's a privacy line people won't cross, even
             | after seeing Alexa and Google Maps enter into our lives; I
             | think people would rather have nothing than a robot that
             | knows every detail of their health, schedule, feelings,
             | plans, etc. in some vaguely defined server somewhere.
        
               | tomduncalf wrote:
               | Don't Google already have that information from your
               | searches, emails, calendar, etc? Obviously you have to
               | trust they don't misuse it, but it's basically the same
               | thing as some personal assistant having it to me.
        
               | bbor wrote:
               | Yeah, but I think this is less of a technical line than
               | an emotional one.
               | 
               | For example: I wanted my personal assistant to track
               | hygiene, which is a natural use case. But then you arrive
               | at the natural conclusion that either a) the user needs
               | to enter the data themselves ("I brushed my teeth and
               | washed my face and took X medications at Y time"), or b)
               | you need some sort of sensor in the bathroom, ranging
               | from mics or radio sensors up to a tasteful camera. And a
               | million subtle versions of (b) is where I see people
               | going "no, that's weird, it's too much info all together"
        
             | mrinterweb wrote:
             | Proprietary and sensitive information. Personally, I use a
             | self-hosted LLM because I don't trust how my conversations
             | with hosted generative AI services will be used.
        
             | fo76yo wrote:
             | Personalized metaspaces, game worlds, content without
             | paying a rent seeker copyright holder.
             | 
             | Education and research without gatekeepers in academia and
             | industry complaining about their book sales or prestige
             | titles being obsoleted
             | 
             | Whole lot of uses cases that break us out of having to
             | kowtow to experts who were merely born before us trying to
             | monopolize exploration of science and technology
             | 
             | To that end I'm working on a GPU accelerated client backed
             | by local AI, with NERFs and Gaussian splatting built in.
             | 
             | The upside to being an EE with MSc in math; most of my
             | money comes from engineering real things. I don't have skin
             | in the cloud CRUD app/API game and don't see a reason to
             | spend money propping up middle men who, given my skills and
             | abilities, don't add value
             | 
             | Programmers can go explore syntax art in their parent's
             | basement again. Tired of 1970s semantics and everyone with
             | a DSL thinking that's the best thing to happen to computing
             | as a field of inquiry ever.
             | 
             | Like all industries big tech is monopolized by aging rent
             | seekers. Disrupt by divesting from it is my play now.
        
         | moneywoes wrote:
         | how are you using it for your project?
        
         | kcorbitt wrote:
         | Hey, I'm the post author. This is a totally fair point! I do
         | think though that depending on your specific requirements open-
         | source models _can_ be a 10x+ improvement. For example, we
         | serve Mistral 7B for less than 1 /10th the cost of GPT-4-Turbo,
         | which is the model most of our users are comparing us to.
        
           | xrd wrote:
           | This is the 10x I was looking for. Great post by the way!
        
         | jdwyah wrote:
         | The real thing is the switching costs. Sure we start with
         | openAI. But at some hackathon in 9 months somebody will try
         | mistral and if that saves real money and still works it feels
         | like any easy swap.
        
         | Joeri wrote:
         | Actually, I think microsoft is going to laugh all the way to
         | the bank, because probably most enterprises will use the Azure
         | OpenAI service instead of directly buying OpenAI's offerings.
        
         | ren_engineer wrote:
         | all they need is an API compatible client library so there is
         | no actual switching cost between models other than
         | configuration. There's a reason OpenAI is adding all sorts of
         | add-on features like assistants and file upload, because they
         | know models themselves are going to be a commodity and they
         | need something to lock developers on their platform
        
         | mmcwilliams wrote:
         | I think at this point the "10x other solutions" should be
         | measured for the cost. If I can process, in perpetuity, 100s of
         | millions of tokens for the cost that OpenAI can do for 10s of
         | millions of tokens one time, that is already past the
         | threshold.
        
       | gpjanik wrote:
       | I want an interactive prompt box with some example prompts and
       | answers from the model and a comparison with GPT-4. My random
       | guess is that this finetuned Mistral-7B is better at nothing or
       | almost nothing than GPT-4 and that's why instead of the above, we
       | got a table with a bunch of irrelevant metrics.
        
         | boredumb wrote:
         | Of course mistral7b is worse than GPT-4, but I can run
         | mistral-7b at home.
        
           | gpjanik wrote:
           | The point is that the article states "averaged across 4
           | diverse customer tasks, fine-tunes based on our new model are
           | slightly stronger than GPT-4, as measured by GPT-4 itself"
           | and then proves it with nothing tangible, just the 4 selected
           | metrics where it performs the best. I mean obviously a
           | finetuned 7B LLM could perform, let's say, text summarization
           | well. The question is what happens if that text contains
           | code, domain-specific knowledge where some facts are less
           | relevant than the other, etc., and that isn't going to be
           | answered by any metric alone. Fundamentally, with enough
           | diverse metrics, each based on a different dataset, the one
           | with the biggest overlap of the dataset for finetuning will
           | perform really well, and the rest, well, not so well.
           | 
           | Bsically, the statistic means that there's a set of data for
           | which that particular (finetuned) network performs slightly
           | better than GPT-4, and everywhere else, pretty bad. It's just
           | not generalizable to everything while GPT-4 is. It's just as
           | good as saying "calculators outperform GPT-4 at counting".
           | Like, yes, they probably do, but I would like to see - is it
           | applicable and practical, or did you just train a LLM to
           | write all the names in Polish alphabetically really well? And
           | that's why qualitative approach for evaluation LLMs is just
           | better.
        
       | avereveard wrote:
       | not a bad model, becomes incoherent at above 8k token, and it's
       | not helped by the fact that's very verbose, but seems very
       | coherent and stay on topic closely until then:
       | https://chat.openai.com/share/089d1b8c-3467-4c01-af9f-6568c0...
       | 
       | fails at math of course, even if the problem is very easy, like
       | all mistrals. good for genration, probably not the best for RAG,
       | there's mistral tunes that stay coherent to 16k tokens, and that
       | cuts down chunking significanty
        
       ___________________________________________________________________
       (page generated 2023-12-20 23:00 UTC)