[HN Gopher] Fine-tune your own Llama 2 to replace GPT-3.5/4
       ___________________________________________________________________
        
       Fine-tune your own Llama 2 to replace GPT-3.5/4
        
       There has been a lot of interest on HN in fine-tuning open-source
       LLMs recently (eg. Anyscale's post at
       https://news.ycombinator.com/item?id=37090632). I've been playing
       around with fine-tuning models for a couple of years, and wanted to
       share some insights and practical code. I've condensed what I've
       learned into a small set of notebooks at
       https://github.com/OpenPipe/OpenPipe/tree/main/examples/clas...,
       covering labeling data, fine-tuning, running efficient inference,
       and evaluating costs/performance. The 7B model we train here
       matches GPT-4's labels 95% of the time on the test set, and for the
       5% of cases where they disagree it's often because the correct
       answer is genuinely ambiguous.  What is fine-tuning? You can think
       of it as a more-powerful form of prompting, where instead of
       writing your instructions in text you actually encode them in the
       weights of the model itself. You do this by training an existing
       model on example input/output pairs that demonstrate the task you
       want your fine-tuned model to learn. Fine-tuning can work with as
       few as 50 examples but I usually try to get 1000+ if possible.
       Prompting still has some big advantages over fine-tuning. It's way
       easier/faster to iterate on your instructions than label data and
       re-train a model. And operationally it's easier to deploy one big
       model and just adjust its behavior as necessary vs deploying many
       small fine-tuned models that will likely each get lower
       utilization.  Fine-tuning has one huge advantage though: it is far
       more effective at guiding a model's behavior than prompting, so you
       can often get away with a _much_ smaller model. That gets you
       faster responses and lower inference costs. A fine-tuned Llama 7B
       model is 50x cheaper than GPT-3.5 on a per-token basis, and for
       many use cases can produce results that are as good or better!  For
       example, classifying the 2M recipes at
       https://huggingface.co/datasets/corbt/all-recipes with GPT-4 would
       cost $23k. Even with GPT-3.5 it would cost over $1k. The model we
       fine-tuned performs similarly to GPT-4 and costs just $19 to run
       over the entire dataset.  Disclaimer: My brother David and I are
       working on an open-source product called OpenPipe
       (https://github.com/openpipe/openpipe) to help engineers adopt
       fine-tuning as simply as possible. But none of the information
       above depends on our startup. The current post is just about
       sharing information that we've learned about fine-tuning. I hope
       it's useful!
        
       Author : kcorbitt
       Score  : 463 points
       Date   : 2023-09-12 16:53 UTC (6 hours ago)
        
       | OhNoNotAgain_99 wrote:
       | just curious would it be possible to add a small network perhaps
       | a books of study material like programming books. freeze the
       | weights of the existing large network, and combined with the new
       | network try to predict the book. The existing networks know
       | language but not the content, the combined network will be
       | trained on the content, and eventually toegther they score
       | better, These "small" added networks might just be specific
       | towards a certain topic (ea learn python or so). Then these small
       | networks can be become modular. esesentially creating some kind
       | of lora networks for LLM's.
       | 
       | Maybe start this way from the ground up, so you can get modular
       | units, for health, finance, programming, education, writting
       | assitance, phyloophy, ethics etc etc. If the modules can be
       | changed, then one might be able to reduce their seize. Ea pick 2
       | or 3 chain them and one has a LLM for a specific area of
       | interest. (reducing running cost)
        
         | sandkoan wrote:
         | This is part of what we're doing at Automorphic. Building
         | shareable, stackable adapters that you can compose like lego
         | bricks.
        
       | 3abiton wrote:
       | Do you think this would end up facilitating the diffusion of
       | finetuned LLMs ckpt models, just like stable diffusion? What's
       | missing is web-UI?
        
         | brucethemoose2 wrote:
         | There are already many hundreds of finetunes on huggingface,
         | and many excellent UIs to run them in, like KoboldCPP and Text-
         | gen-ui: https://huggingface.co/models?sort=modified&search=13B
         | 
         | There is even a crowdsourced version of the UI like artbot:
         | https://lite.koboldai.net/#
         | 
         | And there are some excellent extant finetuning frameworks, like
         | Aoxotol, that run on consumer GPUs:
         | https://github.com/OpenAccess-AI-Collective/axolotl
         | 
         | IIRC Text-gen-ui had a QLORA finetuning UI too.
         | 
         | What I am saying is that its _already_ like Stable Diffusion,
         | but the community is just somewhat under the radar, and
         | finetuning will never be quite as turnkey as dreambooth /sd 1.5
         | LORA due to the nature of the training data.
        
       | indeyets wrote:
       | What are hardware requirements for larger models? What can I
       | fine-tune on Nvidia A100? Will it be possible to work with 70b
       | for example?
        
         | kcorbitt wrote:
         | Depending on what you're trying to accomplish, I'd highly
         | recommend trying the 7B and 13B models first before jumping to
         | the 70B. They're quite capable and I think lots of folks assume
         | they need to jump to a 70B model when really a smaller one
         | would work fine.
         | 
         | That said, you should be able to fine-tune a 70B model on an
         | A100 using QLoRA. However, depending on the specifics of your
         | dataset it might actually be cheaper to run on an 8xA100
         | machine since that way you don't have to swap any weights out
         | to the machine's non-GPU memory, and you might get enough time
         | savings from that that the more expensive machine pays for
         | itself.
        
           | indeyets wrote:
           | The plan was to do it in-house. And buying 8xA100 is a bit
           | too much ;)
        
       | varelse wrote:
       | [dead]
        
       | Maschinesky wrote:
       | What makes sense to fine-tune and what not?
       | 
       | You said 50-1000 examples.
       | 
       | Do I fine-tune when having specific q/a sets like from real
       | customers and I want to add the right answer to the model?
       | 
       | Do I fine-tune facts or should I use some lookup?
       | 
       | Does adding some code and API docs for a current version of
       | something I want more support make sense? Like chatgpt knows
       | quarkus 2 but not quarkus 3
        
         | kcorbitt wrote:
         | > What makes sense to fine-tune and what not?
         | 
         | In general, fine-tuning helps a model figure out how to do the
         | exact task that is being done in the examples it's given. So
         | fine-tuning it on 1000 examples of an API being used in the
         | wild is likely to teach it to use that API really effectively,
         | but fine-tuning it on just the API docs probably won't.
         | 
         | That said, there are a lot of interesting ideas floating around
         | on how to most effectively teach a model purely from
         | instructions like API docs. Powerful models like GPT-4 can
         | figure it out from in-context learning (ie. if you paste in a
         | page of API docs and ask GPT-4 to write something with the API
         | it can usually do a decent job). I suspect the community will
         | figure out techniques either through new training objectives or
         | synthetic training data to do it for smaller fine-tuned models
         | as well.
        
         | Arctic_fly wrote:
         | Generally speaking, fine-tuning a small model makes sense when
         | the task that you want it to carry out is well-defined and
         | doesn't vary too much from one prompt to another. Fine-tuning
         | facts into a model doesn't seem to scale super well, but
         | general textual style, output format, and evaluation criteria
         | for example can all be instilled through the fine-tuning
         | process. I would use lookup if you need your answers to include
         | a wide array of information that the model you're basing off of
         | wasn't initially trained on.
        
       | divbzero wrote:
       | Is Llama 2 currently the way to go for fine-tuning your own
       | models? Are there other open-source LLMs worth considering?
        
         | daemonologist wrote:
         | We've found Flan-T5 to be useful for text-to-text (mostly
         | document QA). Haven't done a lot of testing on fine-tuning yet
         | though.
        
         | loudmax wrote:
         | The Huggingface Leaderboard is mostly dominated by Llama 2
         | variants:
         | https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
         | 
         | It depends a lot on what you're trying to do. If have a focused
         | use case of the type of fine-tuning you want, you can probably
         | get away with one of the smaller models.
         | 
         | Another thing to look out for is Retrieval Augmented Generation
         | (RAG). I don't see it in wide use yet, but it may turn out to
         | more useful than fine tuning for a lot of situations.
        
         | jw903 wrote:
         | It's one of widely fine tuned model for now. Take a look at
         | this colab for fine tuning on your dataset
         | https://github.com/mlabonne/llm-course/blob/main/Fine_tune_L...
        
         | kcorbitt wrote:
         | Depends on your use case. If you're doing pure classification
         | then there are smaller encoder-only models like DeBERTa that
         | might get you better performance with a much smaller model size
         | (so cheaper inference).
         | 
         | But if you need text generation and are ok with a 7B+ parameter
         | model, Llama 2 or one of its derivatives is what I'd strongly
         | recommend. The community around it is much larger than any of
         | the alternatives so the tooling is better, and it's either
         | state of the art or close to it on all evals when compared to
         | other similarly-sized open models.
         | 
         | If you're comfortable sharing more details of the task you're
         | trying to do I might be able to give more specific advice.
        
       | accrual wrote:
       | This looks very helpful! I'm just starting out in the ML/LLM
       | space and have an opportunity to work on this at $dayjob,
       | bookmarking as this looks like an excellent resource. Thank you!
        
       | msp26 wrote:
       | Do you still use few-shot prompting with a fine-tune? Or does it
       | make little difference?
        
         | kcorbitt wrote:
         | Nope, no need for few-shot prompting in most cases once you've
         | fine-tuned on your dataset, so you can save those tokens and
         | get cheaper/faster responses!
        
           | selfhoster11 wrote:
           | Not only that, but in a lot of cases you won't have to fine-
           | tune at all if an existing instruct model does a good enough
           | job with unambiguous enough instructions.
        
         | selfhoster11 wrote:
         | In my experience, there is little need to do that. With
         | completely unambiguous instructions that describe the exact
         | output format, you can often get away with no examples
         | whatsoever. Single examples might be helpful, but multi-shot
         | prompting will be definitely unneeded (and may even harm the
         | model's output quality).
        
       | halyconWays wrote:
       | Someone needs to make an LLM purpose-built for creating high-
       | quality datasets for fine-tuning other LLMs.
        
         | fabmilo wrote:
         | This. The best use of the current llms is to create better
         | Datasets.
        
       | he11ow wrote:
       | Thanks! When it comes to choosing where to work with these
       | models, which compute platform do you recommend (assuming locally
       | doesn't really make sense with my resources)? Colab? AWS
       | StudioLab?
       | 
       | Which is your go to?
        
       | brianjking wrote:
       | Very nice, thanks!
       | 
       | Check out what Matt Shumer put together as well:
       | https://github.com/mshumer/gpt-llm-trainer.
       | 
       | I have used his trainer for auto distillation of GPT-4 into
       | GPT3.5 fine tunes, but plan to do the same for Llama as well.
       | 
       | Cheers!
        
       | minimaxir wrote:
       | > Fine-tuning has one huge advantage though: it is far more
       | effective at guiding a model's behavior than prompting, so you
       | can often get away with a much smaller model. That gets you
       | faster responses and lower inference costs. A fine-tuned Llama 7B
       | model is 50x cheaper than GPT-3.5 on a per-token basis, and for
       | many use cases can produce results that are as good or better!
       | 
       | These comparisons are reductive to the point of being misleading.
       | Even with all the optimizations in the ecosystem, it's not
       | trivial to get a finetuned 7B param model running at an
       | acceptable inference latency. Even if you use a GPU such as an
       | A100 for maximum speed, then you have scalability issues since
       | A100s are scarce. Also, the "50% cheaper" assumes 100%
       | utilization of a GPU which will never happen in production use
       | cases.
       | 
       | Quality-wise, a finetuned Llama 2 is not necessairly better than
       | ChatGPT. Finetuning requires a high-quality dataset which is not
       | easy to construct. And in my own experience with finetuning Llama
       | 2, qualitivately it caused more frustration to get outputs on par
       | with just using ChatGPT.
       | 
       | The value of the ChatGPT API is more dependable scaling and not
       | having to pay for an infra.
        
         | moonchrome wrote:
         | We are talking about 7B models ? Those can run on consumer GPUs
         | with lower latency than A100s AFAIK (because gaming GPUs are
         | clocked different).
         | 
         | Not to mention OpenAI has shit latency and terrible reliability
         | - you should be using Azure models if you care about that - but
         | pricing is also higher.
         | 
         | I would say fixed costs and development time is on openai side
         | but I've seen people post great practical comparisons for
         | latency and cost using hostes fine-tuned small models.
        
           | [deleted]
        
           | minimaxir wrote:
           | "Running" and "acceptable inference speed and quality" are
           | two different constraints, particularly at scale/production.
        
             | moonchrome wrote:
             | I don't understand what you're trying to say ?
             | 
             | From what I've read 4090 should blow A100 away if you can
             | fit within 22GB VRAM, which a 7B model should comfortably.
             | 
             | And the latency (along with variability and availability)
             | on OpenAI API is terrible because of the load they are
             | getting.
        
           | 7speter wrote:
           | When you say it can run on consumer gpus, do you mean pretty
           | much just the 4090/3090 or can it run on lesser cards?
        
             | [deleted]
        
             | gsuuon wrote:
             | Quantized 7B's can comfortably run with 8GB vram
        
             | halflings wrote:
             | I was able to run the 4bit quantized LLAMA2 7B on a 2070
             | Super, though latency was so-so.
             | 
             | I was surprised by how fast it runs on an M2 MBP +
             | llama.cpp; Way way faster than ChatGPT, and that's not even
             | using the Apple neural engine.
        
               | hereonout2 wrote:
               | It runs fantastically well on M2 Mac + llama.cpp, such a
               | variety of factors in the Apple hardware making it
               | possible. The ARM fp16 vector intrinsics, the Macbook's
               | AMX co-processor, the unified memory architecture, etc.
               | 
               | It's more than fast enough for my experiments and the
               | laptop doesn't seem to break a sweat.
        
         | hereonout2 wrote:
         | Doesn't this depend a lot on your application though? Not every
         | workload needs low latency and massive horizontal scalability.
         | 
         | Take their example of running the llm over the 2 million
         | recipes and saving $23k over GPT 4. That could easily be 2
         | million documents in some back end system running in a batch.
         | Many people would wait a few days or weeks for a job like that
         | to finish if it offered significant savings.
        
           | minimaxir wrote:
           | That's more of a fair use case.
           | 
           | It though also demonstrates why the economics are complicated
           | and there's no one-size-fits-all.
        
         | kcorbitt wrote:
         | We're finding that when running Llama-2-7B with vLLM
         | (https://github.com/vllm-project/vllm) on an A40 GPU we're
         | getting consistently lower time-to-first-token and lower
         | average token generation time than GPT-3.5, even when
         | processing multiple requests in parallel. A40s are pretty easy
         | to get your hands on these days (much easer than A100s anyway).
         | 
         | The 50x cheaper (that's 2% of the cost, not 50% of the cost)
         | number does assume 100% GPU utilization, which may or may not
         | be realistic for your use case. If you're doing batch
         | processing as part of a data pipeline, which is not an unusual
         | use case, you can run your GPU at 100% utilization and turn it
         | off when the batch finishes.
         | 
         | If you've got a highly variable workload then you're right,
         | you'll have much lower utilization numbers. But if you work
         | with an aggregator that can quickly hot swap LoRA fine-tunes
         | (as a disclaimer, my company OpenPipe works in this space) you
         | can get back a lot of that lost efficiency since we can
         | increase/decrease GPU capacity only when our aggregate usage
         | changes, which smooths things out.
        
         | [deleted]
        
       | ronyfadel wrote:
       | For translation jobs, I've experimented with Llama 2 70B (running
       | on Replicate) v/s GPT-3.5;
       | 
       | For about 1000 input tokens (and resulting 1000 output tokens),
       | to my surprise, GPT-3.5 turbo was _100x cheaper_ than Llama 2.
       | 
       | Llama 7B wasn't up to the task fyi, producing very poor
       | translations.
       | 
       | I believe that OpenAI priced GPT-3.5 aggressively cheap in order
       | to make it a non-brainer to rely on them rather than relying on
       | other vendors (even open source models).
       | 
       | I'm curious to see if others have gotten different results?
        
         | ramesh31 wrote:
         | >For about 1000 input tokens (and resulting 1000 output
         | tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than
         | Llama 2.
         | 
         | You'll never get actual economics out of switching to open
         | models without running your own hardware. That's the whole
         | point. There's orders of magnitude difference in price, where a
         | single V100/3090 instance can run llama2-70b inference for
         | ~0.50$/hr.
        
           | YetAnotherNick wrote:
           | No, they can't run it. llama 70 with 4 bit quantization takes
           | ~50 GB VRAM for decent enough context size. You need A100, or
           | 2-3 V100 or 4 3090 which all costs roughly roughly $3-5/h
        
             | ramesh31 wrote:
             | Wrong. I am running 8bit GGML with 24GB VRAM on a single
             | 4090 with 2048 context right now
        
               | YetAnotherNick wrote:
               | Which model? I am talking about 70b as mentioned clearly.
               | 70b 8b is 70GB just for the model itself. How much
               | token/second are you getting with single 4090?
        
               | ramesh31 wrote:
               | Offloading 40% of layers to CPU, about 50t/s with 16
               | threads.
        
               | pocketarc wrote:
               | That is more than an order of magnitude better than my
               | experience; I get around 2 t/s with similar hardware. I
               | had also seen others reporting similar figures to mine so
               | I assumed it was normal. Is there a secret to what you're
               | doing?
        
               | ramesh31 wrote:
               | >Is there a secret to what you're doing?
               | 
               | Core speed and memory bandwidth matter a lot. This is on
               | a Ryzen 7950 with DDR5.
        
         | avereveard wrote:
         | Together AI has new aggressive pricing where 70b are on par
         | with gpt35 and everything smaller is fairly cheaper. The catch
         | is the only 32k context length model as of today is their llama
         | 7b which is fairly limited.
        
         | computerex wrote:
         | Replicate has terrible pricing. Have you tried deepinfra?
        
         | halflings wrote:
         | I don't think translation is a great use case for ChatGPT and
         | LLAMA. These models are overwhelmingly trained on English, and
         | LLAMA2 which should have more data from other languages is
         | still focused on languages w/ Latin/Cyrillic characters (so
         | won't work well for Arabic, Hebrew, or CJK languages).
         | 
         | You're better off using models specialized in translation;
         | General purpose LLMs are more useful when fine-tuning on
         | specific tasks (some form of extraction, summarization,
         | generative tasks, etc.), or for general chatbot-like uses.
        
           | achileas wrote:
           | There are plenty of examples in the literature of using LLMs
           | for translation beating the metrics of non-LLM models, even
           | for languages for which there isn't a lot of data.
           | Transliterating non-Latin characters helps a lot with
           | accuracy as well.
        
           | daniels11 wrote:
           | what models would you use for translation? I am working on a
           | language learning tutor (trytutor.app, very early) and
           | GPT-3.5 turbo has been working fine, for the most part.
           | 
           | For foreign language corrections ("correct this German
           | sentence and give a reason for the correction"), GPT-3.5
           | doesn't quite have the horsepower so I use GPT-4
        
           | og_kalu wrote:
           | >You're better off using models specialized in translation
           | 
           | For a couple dozen languages, GPT-4 is by far the best
           | translator you can get your hand on so basically no.
        
             | daniels11 wrote:
             | I will say that GPT-4 is just incredibly expensive. For my
             | app I only use it for advanced translations/corrections,
             | and usually a combination of GPT-3.5+Wiktionary is able to
             | get the more simple stuff done
        
               | all2 wrote:
               | > GPT-3.5+Wiktionary
               | 
               | Can you share more about your app and what you're doing?
        
               | daniels11 wrote:
               | Sure! I'm building a personalized AI language learning
               | tutor using Open AI's API and ElevenLabs (for Text to
               | Speech).
               | 
               | Right now it's basically a chat bot that you can use to
               | practice conversing with. It provides corrections for the
               | things you type. Eventually I'd like to try adding
               | Whisper as well to allow users to speak out loud.
               | 
               | When you hover over a word, you get a translation.
               | Initially I thought using Open AI for every word
               | translation would be too much, but I've been able to get
               | it down to ~36-40 tokens/request. (3-4 cents/1000
               | requests). I also began parsing and uploading some of
               | this [Wiktionary
               | data](https://kaikki.org/dictionary/rawdata.html) and am
               | working on a feature that integrates the GPT-3.5
               | translation with this Wiktionary data.
               | 
               | A lot of these features are still in the works but you
               | can feel free to try it if you like
               | (https://trytutor.app).
        
         | refulgentis wrote:
         | For use cases well within the capabilities of an LLM from last
         | year, fine-tuned LLaMa 2 13B should/will blow ChatGPT out of
         | the water: think "rate the sentiment of this text from 0-10".
         | 
         | I believe this because LLaMa-2 13B is more than good enough to
         | handle what I call "quick search", i.e.
         | 
         | ``` User: "What's the weather in Milwaukee?"
         | 
         | System: Here's some docs, answer concisely in one sentence.
         | 
         | AI: It's 73 degrees Farenheit. ```
         | 
         | YMMV on cost still, depends on cloud vendor, and my intuition
         | agrees with yours: GPT-3.5 is priced low enough that there
         | isn't a case where it makes sense to use another model. It
         | strikes me now that's there's a good reason for that intuition:
         | OpenAI's $/GPU hour is likely <= any other vendor's and
         | inference time of LLaMa 2 ~= GPT.
         | 
         | I do think this will change with local LLMs. They've been way
         | over-hyped for months, but after LLaMa 2, the challenges
         | remaining are more sociological than technical.
         | 
         | For months now it's been one-off $LATEST_BUZZY_MODEL.c stunts
         | that run on desktop.
         | 
         | The vast majority of the _actual_ usage and progress is coming
         | from porn-y stuff, and the investment occurs in one-off stunts.
         | 
         | That split of effort, and lack of engineering rigor, is
         | stunting progress overall.
         | 
         | Microsoft has LLaMa-2 ONNX available on GitHub[1]. There's
         | budding but very small projects in different languages to wrap
         | ONNX. Once there's a genuine cross-platform[2] ONNX wrapper
         | that makes running LLaMa-2 easy, there will be a step change.
         | It'll be "free"[3] to run your fine-tuned model that does as
         | well as GPT-4.
         | 
         | It's not clear to me exactly when this will occur. It's
         | "difficult" now, but only because the _actual usage_ in the
         | local LLM community doesn't have a reason to invest in ONNX,
         | and it's extremely intimidating to figure out how exactly to
         | get LLaMa-2 running in ONNX. Microsoft kinda threw it up on
         | GitHub and moved on, the sample code even still needs a PyTorch
         | model. I see at least one very small company on HuggingFace
         | that _may_ have figured out full ONNX.
         | 
         | Funnily enough, ONNX is getting a spike in mindshare over the
         | last month in the _Stable Diffusion_ community. There's decent
         | cross-pollination between local art and local LLMs, ex. LoRA's
         | were first a thing for Stable Diffusion. So I'm hoping we see
         | this sooner rather than later.
         | 
         | [1] https://github.com/microsoft/Llama-2-Onnx
         | 
         | [2] Definition of cross-platform matters a ton here, what I
         | mean is "I can import $ONNX_WRAPPER_LIB on iOS / Android / Mac
         | / Windows and call Llama2.reply(String prompt, ...)"
         | 
         | [3] Runs on somebody else's computer, where "somebody else" is
         | the user, instead of a cloud vendor.
        
           | homarp wrote:
           | you already have TVM for the cross platform stuff
           | 
           | see https://tvm.apache.org/docs/how_to/deploy/android.html
           | 
           | or https://octoml.ai/blog/using-swift-and-apache-tvm-to-
           | develop...
           | 
           | or https://github.com/mlc-ai/mlc-llm
        
         | kcorbitt wrote:
         | Yes, if you're just using Llama 2 off the shelf (without fine-
         | tuning) I don't think there are a lot of workloads where it
         | makes sense as a replacement for GPT-3.5. The one exception
         | being for organizations where data security is non-negotiable
         | and they really need to host on-prem. The calculus changes
         | drastically though when you bring fine-tuning in, which lets a
         | much smaller model outperform a larger one on many classes of
         | task.
         | 
         | Also, it's worth noting that Replicate started out with a focus
         | on image generation, and their current inference stack for LLMs
         | is extremely inefficient. A significant fraction of the 100x
         | cost difference you mentioned can be made up by using an
         | optimized inference server like vLLM. Replicate knows about
         | this and is working hard on improving their stack, it's just
         | really early for all of us. :)
        
           | bfirsh wrote:
           | Founder of Replicate here. It's early indeed.
           | 
           | OpenAI aren't doing anything magic. We're optimizing Llama
           | inference at the moment and it looks like we'll be able to
           | roughly match GPT 3.5's price for Llama 2 70B.
           | 
           | Running a fine-tuned GPT-3.5 is surprisingly expensive.
           | That's where using Llama makes a ton of sense. Once we've
           | optimized inference, it'll be much cheaper to run a fine-
           | tuned Llama.
        
         | MuffinFlavored wrote:
         | I thought Llama was opensource/free and you could run it
         | yourself?
        
           | loudmax wrote:
           | You can run the smaller Llama variants on consumer grade
           | hardware, but people typically rent GPUs from the cloud to
           | run the larger variants. It is possible to run even larger
           | variants on a beefy workstation or gaming rig, but the
           | performance on consumer hardware usually makes this
           | impractical.
           | 
           | So the comparison would be the cost of renting a cloud GPU to
           | run Llama vs querying ChatGPT.
        
             | ramesh31 wrote:
             | >So the comparison would be the cost of renting a cloud GPU
             | to run Llama vs querying ChatGPT.
             | 
             | Yes, and it doesn't even come close. Llama2-70b can run
             | inference at 300+tokens/s on a single V100 instance at
             | ~$0.50/hr. Anyone who can should be switching away from
             | OpenAI right now.
        
               | thewataccount wrote:
               | What's the best way to use LLama2-70b without existing
               | infrastructure for orchestrating it?
        
               | ramesh31 wrote:
               | >What's the best way to use LLama2-70b without existing
               | infrastructure for orchestrating it?
               | 
               | That's an exercise left to the reader for now, and is
               | where your value/moat lies.
        
               | thewataccount wrote:
               | > That's an exercise left to the reader for now, and is
               | where your value/moat lies.
               | 
               | Hopefully more on-demand services enter the space.
               | Currently where I am we don't have the resources for any
               | type of self orchestration and our use case is so
               | low/sporadic that we can't simply have a dedicated
               | instance.
               | 
               | Last I saw the current services were rather expensive but
               | I should recheck.
        
               | mjirv wrote:
               | I stumbled upon OpenRouter[0] a few days ago. Easiest
               | I've seen by far (if you want SaaS, not hosting it
               | yourself).
               | 
               | [0] https://openrouter.ai
        
           | axpy906 wrote:
           | Unfortunately, Lama2 is not a fully open source license.
        
           | thewataccount wrote:
           | You (currently) need a GPU to run any of the useful models. I
           | haven't really seen a business use-case that runs it on the
           | user's computer, but given the hardware requirements it
           | wouldn't be very feasible to expect.
           | 
           | So you'll have to figure out how to run/scale the model
           | inference. Cloud GPU instances are generally very expensive,
           | and once you start needing to horizontally scale it'll get
           | messy fast.
           | 
           | At least at the moment it's expensive, especially if it's
           | either very light usage or very intensive usage - you either
           | need just a few seconds of compute occasionally, or lots of
           | compute all the time requiring scaling.
           | 
           | The "lucky" ones in this scenario are small-medium businesses
           | that can use one or a few cards on-site for their traffic.
           | Even then when you take the cost of an A100 + maintaining it,
           | etc. OpenAI's offering still looks attractive.
           | 
           | I know there's a few services that try to provide an api
           | similar to what openai has, and some software to self
           | orchestrate it, I'm curious how those compare...
        
             | hereonout2 wrote:
             | > _once you start needing to horizontally scale it 'll get
             | messy fast._
             | 
             | It gets expensive fast, but not messy, these things scale
             | horizontally really well. All the state is encapsulated in
             | the request, no replication, synchronisation, user data to
             | worry about. I'd rather have the job of horizontally
             | scaling llama2 than a relational database.
        
               | thewataccount wrote:
               | For sure, and yeah it wouldn't be terrible you're right.
               | You'd just need the api servers + a load balancer.
               | 
               | My thing is that dynamically doing that is still a lot
               | compared to just calling a single endpoint and all of
               | that is handled for you.
               | 
               | But for sure this is a very decent horizontal use-case.
        
           | kuchenbecker wrote:
           | Compute costs money.
        
         | mrybczyn wrote:
         | Yes, openAI is dumping the market with chat-gpt 3.5. Vulture
         | capital behaviour at its finest, and I'm sure government
         | regulations will definitely catch on to this in 20 or 30
         | years...
         | 
         | It's cheaper than the ELECTRICITY cost of running a llama-70 on
         | your own M1.Max (very energy efficient chip) assuming free
         | hardware.
         | 
         | I guess they are also getting a pretty good cache hit rate -
         | there are only so many questions people ask at scale. But
         | still, it's dumping.
        
           | haxton wrote:
           | gpt3.5 turbo is (mostly likely) Curie which is (most likely)
           | 6.7b params. So, yeah, makes perfect sense that it can't
           | compete with a 70b model on cost.
        
             | why_only_15 wrote:
             | gpt3.5 turbo is a new model, not Curie. As others have
             | stated, it probably uses Mixture of Experts which lowers
             | inference cost.
        
             | jiggawatts wrote:
             | I thought it was fairly well established that GPT 3.5 has
             | something like 130B parameters and that GPT 4 is on the
             | order of 600-1,000
        
             | ronyfadel wrote:
             | It still does a much better job at translation than llama 2
             | 70b even, at 6.7b params
        
               | two_in_one wrote:
               | If it's MOE that may explain why it's faster and
               | better...
        
               | yumraj wrote:
               | MOE?
        
               | sarthaksrinivas wrote:
               | Mixture of Experts Model -
               | https://en.wikipedia.org/wiki/Mixture_of_experts
        
             | csjh wrote:
             | Is there a source on that? I've never seen anyone think
             | it's below even 70B
        
           | PUSH_AX wrote:
           | You think they are caching? Even though one of the parameters
           | is temperature? Can of worms, and should be reflected in the
           | pricing if true, don't get me started if they are charging
           | per token for cached responses.
           | 
           | I just don't see it.
        
             | why_only_15 wrote:
             | You can keep around the KV cache from previous generations
             | which lowers the cost of prompts significantly.
        
           | read_if_gay_ wrote:
           | turbo is likely nowhere near 70b.
        
           | sacred_numbers wrote:
           | Based on my research, GPT-3.5 is likely significantly smaller
           | than 70B parameters, so it would make sense that it's cheaper
           | to run. My guess is that OpenAI significantly overtrained
           | GPT-3.5 to get as small a model as possible to optimize for
           | inference. Also, Nvidia chips are way more efficient at
           | inference than M1 Max. OpenAI also has the advantage of
           | batching API calls which leads to better hardware
           | utilization. I don't have definitive proof that they're not
           | dumping, but economies of scale and optimization seem like
           | better explanations to me.
        
             | csjh wrote:
             | What makes you think 3.5 is significantly smaller than 70B?
        
             | hutzlibu wrote:
             | I also do not have proof of anything here, but can't it be
             | both?
             | 
             | They have lots of money now and the market lead. They want
             | to keep the lead and some extra electricity and hardware
             | costs are surely worth it for them, if it keeps the
             | competition from getting traction.
        
         | AnonymousPlanet wrote:
         | Cost isn't the only incentive not to use an LLM service that
         | resides in a foreign country. Around here, there are industries
         | for which it's pretty much a no-brainer to _avoid_ anything
         | that sends data across the atlantic.
        
           | unoti wrote:
           | Although it wouldn't surprise me if today's Azure OpenAI
           | offerings route to certain US-centric regions, I'd be very
           | surprised if Azure isn't working day and night to try to
           | provision OpenAI capacity everywhere they can in the world.
           | 
           | (Disclaimer: I work in the cloud organization at Microsoft,
           | and these are totally my own thoughts and opinions and don't
           | reflect any kind of inside knowledge I have. I think I can
           | say that provisioning LLM capacity and GPU's is something we
           | basically all have a tremendous amount of passion about.)
        
         | ttt3ts wrote:
         | You can run 70B LLAMA on dual 4090s/3090s with quantization.
         | Going with dual 3090s you can get a system that can run LLAMA 2
         | 70B with 12K context for < $2K.
         | 
         | I built two such a systems after burning that much in a week on
         | ChatGPT.
        
           | zakki wrote:
           | Should you mind to share all your PC HW (mobo, casing,
           | cooling, etc) for this dual GPU configuration? Thanks.
        
         | octacat wrote:
         | Google Maps was also cheap. Initially. So it is aggressively
         | cheap now, but would aggressively change later.
        
         | Arctic_fly wrote:
         | > Llama 7B wasn't up to the task fyi, producing very poor
         | translations.
         | 
         | From what I've read and personally experimented with, none of
         | the Llama 2 models are well-suited to translation in particular
         | (they were mainly trained on English data). Still, there are a
         | number of tasks that they're really good at if fine-tuned
         | correctly, such as classification and data extraction.
         | 
         | > I believe that OpenAI priced GPT-3.5 aggressively cheap in
         | order to make it a non-brainer to rely on them rather than
         | relying on other vendors (even open source models).
         | 
         | I think you're definitely right about that, and in most cases
         | just using GPT 3.5 for one-off tasks makes the most sense. I
         | think when you get into production workflows that scale, that's
         | when using a small fine-tuned models starts making more sense.
         | You can drop the system prompt and get data in the format you'd
         | expect it in, and train on GPT-4's output to sometimes get
         | better accuracy than 3.5 would give you right off the bat. And
         | keep in mind, while you can do the same thing with a fine-tuned
         | 3.5 model, it's going to cost 8x the base 3.5 price per token.
        
           | kelseyfrog wrote:
           | Is that because translation is typically an encoder-decoder
           | task and llama is decoder only or is there something else
           | about it that makes the last difficult for llama?
        
             | FeepingCreature wrote:
             | If you don't make it learn other-language texts, it won't
             | be able to speak that language.
        
         | brucethemoose2 wrote:
         | TBH, Replicate is not a great way to run 7B beyond
         | experimentation. You want a host with cheap consumer GPUs (like
         | vast.ai) since the 4-bit requirements are so modest.
         | 
         | You either need a backend with good batching support (vLLM), or
         | if you don't need much throughput, an extremely low end GPU or
         | no GPU at all for exLlama/llama.cpp.
         | 
         | OpenAI benefits from quantization/batching, optimized kernels
         | and very high utilization on their end, so the huge price gap
         | vs a default HF Transformers instance is understandable. But
         | even then, you are probably right about their aggressive
         | pricing.
         | 
         | As for quality, you need a llama model finetunes on the target
         | language (many already exist on Huggingface) and possibly
         | custom grammar if your backend supports it.
        
         | nborwankar wrote:
         | Llama and GPT are auto-regressive decoder only architectures
         | which for pure translation jobs are not the optimal
         | architectures. Training seq2seq models or encoder/decoder
         | models on datasets of sentence pairs designed for translation
         | will likely allow you to use much smaller models. You will not
         | be wasting parameters on general "language understanding"
         | capability that Llama and GPT have if pure translation is all
         | you need. T5 or Flan-T5 might be good starting points.
        
       | rookie123 wrote:
       | To all those who are on this panel, which is the most
       | comprehensive way a newbie can learn fine-tuning these models
       | with or without the GPUs?
       | 
       | Are there any well directed courses available?
        
         | kcorbitt wrote:
         | I wrote the notebooks in the post with the intention of them
         | being a gentle introduction to fine-tuning. Would love any
         | feedback on open questions you have as you go through them!
        
       | idosh wrote:
       | Can you elaborate on your plans for OpenPipe? Sounds like a very
       | interesting project
        
         | Arctic_fly wrote:
         | Currently OpenPipe allows you to capture input/output from a
         | powerful model and use it to fine-tune a much smaller one, then
         | offers you the option to host through OpenPipe or download it
         | and host it elsewhere. Models hosted on OpenPipe enjoy a few
         | benefits, like data drift detection and automatic reformatting
         | of output to match the original model you trained against
         | (think extraction "function call" responses from a purely
         | textual Llama 2 response) through the sdk.
         | 
         | Longer-term, we'd love to expand the selection of base models
         | to include specialized LLMs that are particularly good at a
         | certain task, e.g. language translation, and let you train off
         | of those as well. Providing a ton of specialized starting
         | models will decrease the amount of training data you need, and
         | increase the number of tasks at which fine-tuned models can
         | excel.
        
           | idosh wrote:
           | Thanks! I need to dive into the project and learn more.
           | Sounds exciting
        
       | binarymax wrote:
       | This looks awesome! Tangential question - do you find GPT
       | function calling to work consistently and without error, or do
       | you get errors when using it? By errors I mostly mean incorrect
       | function signatures/types or missing values...but if you see
       | other unpredictable behavior that would help too.
        
         | llwj wrote:
         | I see wrong responses about 1% of the time, but I love it,
         | considering parsing raw text output without function calling
         | had a much higher error rate.
        
         | Arctic_fly wrote:
         | I haven't had much trouble with GPT 3.5 or 4 function calls
         | returning in an undesirable format recently. I did get a few
         | bad syntax responses when OpenAI first rolled it out, but not
         | for the past few months.
         | 
         | Llama 2 can also pick the function call format up, given
         | sufficient training data that contains function call responses,
         | though you'll then have to parse the returned object out of the
         | text-based response.
        
       | facu17y wrote:
       | "to replace GPT-3.5/4"
       | 
       | Very inflated statement when it comes to GPT4 since it is a MoE
       | model with 8 separate models each an expert in one area, and you
       | can't replace all 8 models with one model trained for $19.
       | 
       | I call BS on this claim. Maybe it matches GPT4 in the narrow
       | domain you fine-tune it for, and if that can be done for $19 then
       | for $19*8 you can take OpenAI out of business. That doesn't add
       | up.
        
       | derekpankaew wrote:
       | Can you clarify the 50x cheaper number? Is this for self-hosting,
       | or if you're hosting on OpenPipe?
       | 
       | The pricing on OpenPipe says it's 0.0012 to 0.0016 per 1K tokens
       | for Llama 7b. GPT-3.5 pricing is 0.0015 to 0.002, so not that
       | different.
       | 
       | I'm assuming the 50x cost reductions are primarily from self-
       | hosting?
        
         | kcorbitt wrote:
         | Yep, the 50x cost reduction is if you self-host a fine-tuned
         | model using the setup demonstrated in in the linked notebooks.
        
       | robot wrote:
       | for startups I guess this means nail your use case with gpt-4,
       | and when scaling cost becomes an issue consider fine tuning.
        
       | jesusofnazarath wrote:
       | Can't we have something for the command line that takes the form
       | of                   cat new_data.txt | finetune model.file >
       | new_model.file
        
         | lgas wrote:
         | Sure, it would be trivial to turn the second notebook into a
         | script that behaves this way.
        
       | notShabu wrote:
       | This post made me think of human hierarchies. Line level ICs are
       | cheap because they are specialized and fine tuned. Leet code is a
       | way to roughly measure degree of fine-tuning even though it
       | doesn't accurately measure how well the fine tuning is for the
       | job.
       | 
       | As you go up the hierarchy what you want is higher quality
       | answers to more and more abstract and general questions.
       | 
       | AGI, God, CEOs, and figures like Paul Graham, Elon Musk etc.. all
       | answer to various degrees the ultimate abstract question of "What
       | is the meaning of _gestures wildly at everything_ "
       | 
       | Cost efficiency and commoditization basically increases "how"
       | capacity at the cost of "why" capacity
        
         | ftxbro wrote:
         | > AGI, God, CEOs, and figures like Paul Graham, Elon Musk
         | 
         | hacker news pantheon just dropped
        
       | jxf wrote:
       | Q: How did you arrive at the $23k figure for classifying 2M
       | examples using GPT-4?
        
         | kcorbitt wrote:
         | We ran 5K randomly selected recipes through GPT-4 and
         | extrapolated based on the average cost per query.
        
           | jxf wrote:
           | Makes sense. Thank you!
        
       | [deleted]
        
       | avereveard wrote:
       | A 7b model will work for very specific cases, but it will have a
       | hard time drawing parallels between synonims, so you'll need to
       | be extremely careful in building your fine tuning samples.
        
       | tikkun wrote:
       | Looks really well executed, nice! I'd shared this idea with a few
       | people. GPT and other LLMs don't allow you to use their output to
       | train competing models, but the implication is that it's fine to
       | use their output to train your own internal alternative models.
       | So you can't sell access to the output as an API, but you can use
       | it to replace your GPT API calls.
       | 
       | My other thoughts to extend this are that you could make it
       | seamless. To start, it'll simply pipe the user's requests to
       | OpenAI or their existing model. So it'd be a drop in replacement.
       | Then, it'll every so often offer to the user - "hey we think at
       | this point there's enough data that a fine tune might save you
       | approx $x/month based on your current calls, click the button to
       | start the fine tune and we'll email you once we have the results"
       | - and then the user gets the email "here are the results, based
       | on that we recommend switching, click here to switch to calling
       | your fine-tuned model" - Helicone and the other monitoring
       | platforms could also offer something similar. (Side note I'm
       | working on an "ai infra handbook" aimed at technical people in
       | software orgs looking to deploy unspecified "AI" features and
       | trying to figure out what to do and what resources they'll need -
       | it's a 20+ page google doc, if anyone can help me review what I
       | have so far please let me know and I'll add you.)
       | 
       |  _If_ it 's latency/error/speed competitive, and cheaper, and
       | equivalently accurate, then for anyone doing production scale LLM
       | API usage it'd make sense to use something like this - either the
       | fine-tune is worse so you keep using the regular API, or the fine
       | tune has parity plus cost and/or speed advantage, so you switch.
       | (It wouldn't make sense for prototyping scale, because the
       | additional complexity of the switch wouldn't be worth it unless
       | it could save you 4/5 or more figures a year in API costs I'd
       | think.)
        
         | bambax wrote:
         | > _Side note I 'm working on an "ai infra handbook" aimed at
         | technical people in software orgs looking to deploy unspecified
         | "AI" features and trying to figure out what to do and what
         | resources they'll need - it's a 20+ page google doc, if anyone
         | can help me review what I have so far please let me know and
         | I'll add you._
         | 
         | Interested in helping out.
        
         | kcorbitt wrote:
         | > My other thoughts to extend this are that you could make it
         | seamless. To start, it'll simply pipe the user's requests to
         | OpenAI or their existing model. So it'd be a drop in
         | replacement. Then, it'll every so often offer to the user -
         | "hey we think at this point there's enough data that a fine
         | tune might save you approx $x/month based on your current
         | calls, click the button to start the fine tune and we'll email
         | you once we have the results" - and then the user gets the
         | email "here are the results, based on that we recommend
         | switching, click here to switch to calling your fine-tuned
         | model"
         | 
         | You just described our short-term roadmap. :) Currently an
         | OpenPipe user has to explicitly kick off a fine-tuning job, but
         | they're so cheap to run we're planning on letting users opt in
         | to running them proactively once they have enough data so we
         | can provide exactly that experience.
        
         | jlm521 wrote:
         | I would be like to help in reviewing your handbook.
        
         | NavinF wrote:
         | >GPT and other LLMs don't allow you to use their output to
         | train competing models
         | 
         | ToS is unenforceable and irrelevant to anyone that's in this
         | space
        
         | bongobingo1 wrote:
         | > GPT and other LLMs don't allow you to use their output to
         | train competing models
         | 
         | I didn't allow them to use my output to train theirs either,
         | _so fuck 'em_.
        
           | [deleted]
        
       | rrherr wrote:
       | "You do this by training an existing model on example
       | input/output pairs that demonstrate the task you want your fine-
       | tuned model to learn."
       | 
       | Are fine-tuning datasets required to be input/output pairs? Or
       | instead, can the fine-tuning be autoregressive (predict the next
       | token throughout this corpus of unlabeled documents)?
        
         | kcorbitt wrote:
         | There's no rule that your fine-tuning dataset needs to be split
         | into input/output pairs -- you can of course fine-tune a model
         | to just continue a sequence.
         | 
         | As a practical matter though, most of the fine-tuning
         | frameworks, including Axolotl (which this guide uses) and
         | HuggingFace's SFTTrainer (the actual fine-tuning trainer most
         | frameworks use under the hood) assume your data comes in
         | input/output pairs, and automatically inserts a separator token
         | to let the model know that the input has finished and it should
         | start generating the output. In general most tasks can be
         | formulated this way, including autocomplete tasks, so I'd
         | probably recommend going that way unless you have a very strong
         | reason not to.
        
           | rrherr wrote:
           | "most tasks can be formulated this way, including
           | autocomplete tasks"
           | 
           | For autocomplete tasks, with a corpus of unlabeled documents,
           | would you insert a separator token at an arbitrary space in
           | each document, in order to form input/output pairs?
        
       | ingridpan wrote:
       | I found this tutorial helpful for getting started with fine-
       | tuning https://www.youtube.com/watch?v=74NSDMvYZ9Y
       | 
       | This guy used gradient.ai and he has a Google Collab to try it
        
       ___________________________________________________________________
       (page generated 2023-09-12 23:00 UTC)