[HN Gopher] Beating GPT-4 on HumanEval with a fine-tuned CodeLla... ___________________________________________________________________ Beating GPT-4 on HumanEval with a fine-tuned CodeLlama-34B Hi HN, We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67%. To ensure result validity, we applied OpenAI's decontamination methodology to our dataset. The CodeLlama models released yesterday demonstrate impressive performance on HumanEval. - CodeLlama-34B achieved 48.8% pass@1 on HumanEval - CodeLlama-34B-Python achieved 53.7% pass@1 on HumanEval We have fine-tuned both models on a proprietary dataset of ~80k high-quality programming problems and solutions. Instead of code completion examples, this dataset features instruction-answer pairs, setting it apart structurally from HumanEval. We trained the Phind models over two epochs, for a total of ~160k examples. LoRA was not used -- both models underwent a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens. Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples. The methodology is: - For each evaluation example, we randomly sampled three substrings of 50 characters or used the entire example if it was fewer than 50 characters. - A match was identified if any sampled substring was a substring of the processed training example. For further insights on the decontamination methodology, please refer to Appendix C of OpenAI's technical report. Presented below are the pass@1 scores we achieved with our fine-tuned models: - Phind-CodeLlama-34B-v1 achieved 67.6% pass@1 on HumanEval - Phind-CodeLlama-34B-Python-v1 achieved 69.5% pass@1 on HumanEval Note on GPT-4 According to the official technical report in March, OpenAI reported a pass@1 score of 67% for GPT-4's performance on HumanEval. Since then, there have been claims reporting higher scores. However, it's essential to note that there hasn't been any concrete evidence pointing towards an enhancement in the model's coding abilities since then. It's also crucial to highlight that these elevated figures lack the rigorous contamination analysis that the official statistic underwent, making them less of a reliable comparison. As a result, we consider 67% as the pass@1 score for GPT-4. Download We are releasing both models on Huggingface for verifiability and to bolster the open-source community. We welcome independent verification of results. Phind- CodeLlama-34B-v1: https://huggingface.co/Phind/Phind- CodeLlama-34B-v1 Phind-CodeLlama-34B-Python-v1: https://huggingface.co/Phind/Phind-CodeLlama-34B-Python-v1 We'd love to hear your thoughts! Best, The Phind Team Author : rushingcreek Score : 97 points Date : 2023-08-25 22:08 UTC (51 minutes ago) (HTM) web link (www.phind.com) (TXT) w3m dump (www.phind.com) | bfogelman wrote: | Glad this work is happening! That said, HumanEval as the current | gold standard for benchmarking models is a crime. The dataset | itself is tiny (around 150) examples and all the problems | themselves aren't really indicative of actual software | engineering problems. Also, we've been able to get around 85% | pass@1 on GPT-4 internally as of a couple weeks ago. It's hard to | say if they've contaminated the models with RLHF though. It still | is exciting how close we're getting with open source models but | we've still got a decent amount of work to go! | bfogelman wrote: | One thing I'd be curious to see is how well this translates to | things outside of HumanEval! How does it compare to using | ChatGPT for example. | rushingcreek wrote: | Yes -- we're being careful with our claims here. This model is | not yet necessarily a better coding model overall, but it's | strong on Python. | | We're working hard to use these advances to make models that | are production ready. One such idea is to run a mixture of | experts on various fine-tuned CodeLlamas. | DigitalNoumena wrote: | I think the issue of test set contamination is important, but | it's academic - when a model contains a good enough distilled | representation of arguably all the code out there, does it | really matter whether it can generalise OOD? | | Realistically how many of the practical use cases where it'll | be applied will be OOD? If you can take GPT4 there then you are | either a genius or working on something extremely novel so why | use GPT4 in the first place? | | I understand the goal is for LLMs to get there, but the | majority of practical applications just don't need that. | vikp wrote: | Did you use the same pass@1 generation method as in the code | llama paper (greedy decoding)? I couldn't find this in the blog | post. | rushingcreek wrote: | We used sampling with temperature=0.1. Reproduction details can | be found on the Huggingface model card: | https://huggingface.co/Phind/Phind-CodeLlama-34B-v1 | vikp wrote: | Got it, thanks - and thanks for the model! I'd be interested | in the results if anyone benchmarks without sampling. | | Edit: it could also be misleading to directly compare | humaneval pass@1 against codellama without the same | generation methodology. (possibly against GPT-4, also, but I | don't know their methodology). | ocolegro wrote: | nice result! I've been looking into benchmarking models recently, | it would be interesting to run your model through on the same | battery of tests [https://github.com/emrgnt-cmplxty/zero-shot- | replication/blob...] | tyhjnntyny wrote: | This stuff is accelerating it alarming pace. I know of three or | four off the shelf solutions to self host and run models now. I | have learned so much and continue to learn about how this stuff | works. | Jagerbizzle wrote: | Any recommendations for learning material? | lemonlym wrote: | This is impressive, so congratulations. I also think that this is | a great example of how open source can simply lead to faster and | better acceleration. | rikafurude21 wrote: | Ive used GPT 4 for pretty much all of my programming needs and | the convenience of a 20 dollar subscription taking care of | everything and letting me use a LLM without having to set up any | models or servers has been just so simple, is the 2 percent gain | worth looking into running a local model again? I tried running a | local model a couple months ago but the perfomance was bad. I | know code llama came out very recently but does anyone have any | thoughts on perfomance regardi g programming tasks compared to | GPT | cube2222 wrote: | Congrats! | | I've played with Phind a few times, and it's definitely one of | the cooler products to come out of the LLM boom. | Fischgericht wrote: | Are you planning to switch to programming-language optimized | model inside Phind? So, if a user is asking for something related | to Python, that the python-optimized model gets used? | | If so: | | The Object Pascal language is completely out of fashion, and the | most non-hyped language there is. However, there are hundreds of | thousands of active users of Delphi, FreePascal and Lazarus. And | due to the language being stable for over 20 years, there also is | a gigantic amount of highest-quality code available. As most of | it is neither on Github nor StackOverflow, Pascal code is | dramatically underrepresented in GPT3.5, GPT-4 - and therefore | also in Phind. | | I'd like to finally be able to use AI-assisted programming with | Pascal. | | In case you are interested in that, I would be willing to | internally pay for the work to prepare a good dataset of high | quality code with comments/context/prompts. | | If you are not interested, is there any chance that you are going | to release the code and toolchain used to fine-tune CodeLlama, so | I could do it myself? | rushingcreek wrote: | Yes, this is the direction we're heading towards! We're | building a mixture of experts of different coding models that | we will deploy for precisely this use case. | Fischgericht wrote: | Nice! | | I suppose that Pascal is not on your planned list of | supported languages, right? | behnamoh wrote: | Why would it? Do you know how much it costs to finetune one | of these models for such a niche language? I'm not just | talking about the cost of training, but also the cost of | acquiring data because there's much less data about niche | languages. | jacquesm wrote: | > I would be willing to internally pay for the work | | What kind of budget do you think this will require? | Fischgericht wrote: | Not much, I guess. It's basically writing some scripts that | will take the code base of some of the available high quality | pascal projects, and then depending on what is available | extract/merge documentation available as PDF, PasDoc, RTF, | .HLP or method/function source code comments. | | I would assume that one of my devs could write the needed | scripts in three weeks or so. | | So, basically a budget of <$5000. | | For me - due to missing competence - the actual challenge | would be to get a sample on how training data should | optimally look like (for example the Python training set), | and someone doing the actual training. For a newbie to get up | the required level of competence surely will take more than | three weeks. | gojomo wrote: | Does this page have a bunch of blue-text runs that aren't links? | sp332 wrote: | Yes. I think it's just a highlight. The HTML doesn't look like | it's trying to be a link or anything. | rushingcreek wrote: | it's not supposed to be a link, but I see how it'd be | confusing. will fix | tomr75 wrote: | what hardware could one run this locally on? probably need 3090 | etc right? | | finally a reason to upgrade from my m1 max | 1024core wrote: | > proprietary dataset | | This reads more like advertising copy than an HN article. ___________________________________________________________________ (page generated 2023-08-25 23:00 UTC)