[HN Gopher] Beating GPT-4 on HumanEval with a fine-tuned CodeLla...
       ___________________________________________________________________
        
       Beating GPT-4 on HumanEval with a fine-tuned CodeLlama-34B
        
       Hi HN,  We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python
       on an internal Phind dataset that achieved 67.6% and 69.5% pass@1
       on HumanEval, respectively. GPT-4 achieved 67%. To ensure result
       validity, we applied OpenAI's decontamination methodology to our
       dataset.  The CodeLlama models released yesterday demonstrate
       impressive performance on HumanEval.  - CodeLlama-34B achieved
       48.8% pass@1 on HumanEval  - CodeLlama-34B-Python achieved 53.7%
       pass@1 on HumanEval  We have fine-tuned both models on a
       proprietary dataset of ~80k high-quality programming problems and
       solutions. Instead of code completion examples, this dataset
       features instruction-answer pairs, setting it apart structurally
       from HumanEval. We trained the Phind models over two epochs, for a
       total of ~160k examples. LoRA was not used -- both models underwent
       a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash
       Attention 2 to train these models in three hours using 32 A100-80GB
       GPUs, with a sequence length of 4096 tokens.  Furthermore, we
       applied OpenAI's decontamination methodology to our dataset to
       ensure valid results, and found no contaminated examples.  The
       methodology is:  - For each evaluation example, we randomly sampled
       three substrings of 50 characters or used the entire example if it
       was fewer than 50 characters.  - A match was identified if any
       sampled substring was a substring of the processed training
       example.  For further insights on the decontamination methodology,
       please refer to Appendix C of OpenAI's technical report.  Presented
       below are the pass@1 scores we achieved with our fine-tuned models:
       - Phind-CodeLlama-34B-v1 achieved 67.6% pass@1 on HumanEval  -
       Phind-CodeLlama-34B-Python-v1 achieved 69.5% pass@1 on HumanEval
       Note on GPT-4  According to the official technical report in March,
       OpenAI reported a pass@1 score of 67% for GPT-4's performance on
       HumanEval. Since then, there have been claims reporting higher
       scores. However, it's essential to note that there hasn't been any
       concrete evidence pointing towards an enhancement in the model's
       coding abilities since then. It's also crucial to highlight that
       these elevated figures lack the rigorous contamination analysis
       that the official statistic underwent, making them less of a
       reliable comparison. As a result, we consider 67% as the pass@1
       score for GPT-4.  Download  We are releasing both models on
       Huggingface for verifiability and to bolster the open-source
       community. We welcome independent verification of results.  Phind-
       CodeLlama-34B-v1: https://huggingface.co/Phind/Phind-
       CodeLlama-34B-v1  Phind-CodeLlama-34B-Python-v1:
       https://huggingface.co/Phind/Phind-CodeLlama-34B-Python-v1  We'd
       love to hear your thoughts!  Best,  The Phind Team
        
       Author : rushingcreek
       Score  : 97 points
       Date   : 2023-08-25 22:08 UTC (51 minutes ago)
        
 (HTM) web link (www.phind.com)
 (TXT) w3m dump (www.phind.com)
        
       | bfogelman wrote:
       | Glad this work is happening! That said, HumanEval as the current
       | gold standard for benchmarking models is a crime. The dataset
       | itself is tiny (around 150) examples and all the problems
       | themselves aren't really indicative of actual software
       | engineering problems. Also, we've been able to get around 85%
       | pass@1 on GPT-4 internally as of a couple weeks ago. It's hard to
       | say if they've contaminated the models with RLHF though. It still
       | is exciting how close we're getting with open source models but
       | we've still got a decent amount of work to go!
        
         | bfogelman wrote:
         | One thing I'd be curious to see is how well this translates to
         | things outside of HumanEval! How does it compare to using
         | ChatGPT for example.
        
         | rushingcreek wrote:
         | Yes -- we're being careful with our claims here. This model is
         | not yet necessarily a better coding model overall, but it's
         | strong on Python.
         | 
         | We're working hard to use these advances to make models that
         | are production ready. One such idea is to run a mixture of
         | experts on various fine-tuned CodeLlamas.
        
         | DigitalNoumena wrote:
         | I think the issue of test set contamination is important, but
         | it's academic - when a model contains a good enough distilled
         | representation of arguably all the code out there, does it
         | really matter whether it can generalise OOD?
         | 
         | Realistically how many of the practical use cases where it'll
         | be applied will be OOD? If you can take GPT4 there then you are
         | either a genius or working on something extremely novel so why
         | use GPT4 in the first place?
         | 
         | I understand the goal is for LLMs to get there, but the
         | majority of practical applications just don't need that.
        
       | vikp wrote:
       | Did you use the same pass@1 generation method as in the code
       | llama paper (greedy decoding)? I couldn't find this in the blog
       | post.
        
         | rushingcreek wrote:
         | We used sampling with temperature=0.1. Reproduction details can
         | be found on the Huggingface model card:
         | https://huggingface.co/Phind/Phind-CodeLlama-34B-v1
        
           | vikp wrote:
           | Got it, thanks - and thanks for the model! I'd be interested
           | in the results if anyone benchmarks without sampling.
           | 
           | Edit: it could also be misleading to directly compare
           | humaneval pass@1 against codellama without the same
           | generation methodology. (possibly against GPT-4, also, but I
           | don't know their methodology).
        
       | ocolegro wrote:
       | nice result! I've been looking into benchmarking models recently,
       | it would be interesting to run your model through on the same
       | battery of tests [https://github.com/emrgnt-cmplxty/zero-shot-
       | replication/blob...]
        
       | tyhjnntyny wrote:
       | This stuff is accelerating it alarming pace. I know of three or
       | four off the shelf solutions to self host and run models now. I
       | have learned so much and continue to learn about how this stuff
       | works.
        
         | Jagerbizzle wrote:
         | Any recommendations for learning material?
        
       | lemonlym wrote:
       | This is impressive, so congratulations. I also think that this is
       | a great example of how open source can simply lead to faster and
       | better acceleration.
        
       | rikafurude21 wrote:
       | Ive used GPT 4 for pretty much all of my programming needs and
       | the convenience of a 20 dollar subscription taking care of
       | everything and letting me use a LLM without having to set up any
       | models or servers has been just so simple, is the 2 percent gain
       | worth looking into running a local model again? I tried running a
       | local model a couple months ago but the perfomance was bad. I
       | know code llama came out very recently but does anyone have any
       | thoughts on perfomance regardi g programming tasks compared to
       | GPT
        
       | cube2222 wrote:
       | Congrats!
       | 
       | I've played with Phind a few times, and it's definitely one of
       | the cooler products to come out of the LLM boom.
        
       | Fischgericht wrote:
       | Are you planning to switch to programming-language optimized
       | model inside Phind? So, if a user is asking for something related
       | to Python, that the python-optimized model gets used?
       | 
       | If so:
       | 
       | The Object Pascal language is completely out of fashion, and the
       | most non-hyped language there is. However, there are hundreds of
       | thousands of active users of Delphi, FreePascal and Lazarus. And
       | due to the language being stable for over 20 years, there also is
       | a gigantic amount of highest-quality code available. As most of
       | it is neither on Github nor StackOverflow, Pascal code is
       | dramatically underrepresented in GPT3.5, GPT-4 - and therefore
       | also in Phind.
       | 
       | I'd like to finally be able to use AI-assisted programming with
       | Pascal.
       | 
       | In case you are interested in that, I would be willing to
       | internally pay for the work to prepare a good dataset of high
       | quality code with comments/context/prompts.
       | 
       | If you are not interested, is there any chance that you are going
       | to release the code and toolchain used to fine-tune CodeLlama, so
       | I could do it myself?
        
         | rushingcreek wrote:
         | Yes, this is the direction we're heading towards! We're
         | building a mixture of experts of different coding models that
         | we will deploy for precisely this use case.
        
           | Fischgericht wrote:
           | Nice!
           | 
           | I suppose that Pascal is not on your planned list of
           | supported languages, right?
        
             | behnamoh wrote:
             | Why would it? Do you know how much it costs to finetune one
             | of these models for such a niche language? I'm not just
             | talking about the cost of training, but also the cost of
             | acquiring data because there's much less data about niche
             | languages.
        
         | jacquesm wrote:
         | > I would be willing to internally pay for the work
         | 
         | What kind of budget do you think this will require?
        
           | Fischgericht wrote:
           | Not much, I guess. It's basically writing some scripts that
           | will take the code base of some of the available high quality
           | pascal projects, and then depending on what is available
           | extract/merge documentation available as PDF, PasDoc, RTF,
           | .HLP or method/function source code comments.
           | 
           | I would assume that one of my devs could write the needed
           | scripts in three weeks or so.
           | 
           | So, basically a budget of <$5000.
           | 
           | For me - due to missing competence - the actual challenge
           | would be to get a sample on how training data should
           | optimally look like (for example the Python training set),
           | and someone doing the actual training. For a newbie to get up
           | the required level of competence surely will take more than
           | three weeks.
        
       | gojomo wrote:
       | Does this page have a bunch of blue-text runs that aren't links?
        
         | sp332 wrote:
         | Yes. I think it's just a highlight. The HTML doesn't look like
         | it's trying to be a link or anything.
        
           | rushingcreek wrote:
           | it's not supposed to be a link, but I see how it'd be
           | confusing. will fix
        
       | tomr75 wrote:
       | what hardware could one run this locally on? probably need 3090
       | etc right?
       | 
       | finally a reason to upgrade from my m1 max
        
       | 1024core wrote:
       | > proprietary dataset
       | 
       | This reads more like advertising copy than an HN article.
        
       ___________________________________________________________________
       (page generated 2023-08-25 23:00 UTC)