[HN Gopher] Refact LLM: New 1.6B code model reaches 32% HumanEva...
       ___________________________________________________________________
        
       Refact LLM: New 1.6B code model reaches 32% HumanEval and is SOTA
       for the size
        
       Author : kateklink
       Score  : 154 points
       Date   : 2023-09-04 16:13 UTC (6 hours ago)
        
 (HTM) web link (refact.ai)
 (TXT) w3m dump (refact.ai)
        
       | iFire wrote:
       | LICENSE
       | 
       | bigscience-openrail-m
       | 
       | https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/mai...
        
         | [deleted]
        
       | palmer_fox wrote:
       | All these LLMs are pretty general if I understand correctly. Are
       | there any efforts to create specialized models (other than for
       | coding)? Or, what would be even better, "extract" certain areas
       | from existing LLMs as a way to specialize them? With the goal to
       | drastically reduce model size to be able to run on less powerful
       | devices.
       | 
       | E.g. a model specializing in chemistry doesn't need to include
       | data on world's history or to be able to write poetry.
        
         | hnhg wrote:
         | I am not an expert but it still has to learn human
         | language/grammar/whathaveyou, and that is where scale seems to
         | matter. Fine-tuning on a subset of knowledge after that is
         | typically how the domain-specialisation is achieved, by my
         | understanding.
        
           | charcircuit wrote:
           | Domain specialization is done by continuing the full training
           | process. Fine tuning is more for changing the style of the
           | output than adding new knowledge.
        
             | palmer_fox wrote:
             | What if the initial training already contains all necessary
             | data for a particular specialization? What would be the
             | benefit of continuing the training process?
        
               | viraptor wrote:
               | Imagine someone tells you about how someone committed a
               | crime and asks you to summarise. Now imagine the same
               | question is asked to a lawyer. Even if you both knew the
               | same facts, the response would be very different in
               | style, highlighted points, mentioned references, etc. The
               | domain specific fine tuning does exactly that. Sure,
               | sometimes you can get very close by changing the prompt
               | to include "respond like a lawyer in situation X with
               | following extra rules", but not always and the fine-
               | tuning gives better results and shorter prompt.
        
           | palmer_fox wrote:
           | I was wondering about that too. Would it be possible in the
           | future to have a more modular approach to LLMs? Have a module
           | that is responsible for basic knowledge/language/grammar and
           | then other more specialized modules that are added
           | selectively.
           | 
           | I don't know enough about fine-tuning, not sure if the
           | process is capable of removing "unused" parts of the model (I
           | guess not possible, similar to un-learning).
        
             | lucubratory wrote:
             | There are various methods for removing unused parts of the
             | model, like distillation. The idea is generally that you
             | always lose performance, but hopefully you lose more
             | size/runcost than you do performance, proportionately.
        
         | swyx wrote:
         | so, so many. there are RAG specific models (contextual ai),
         | finance specific models (bloomberg gpt, brightwave), contact
         | center models (cresta), even telco models (anthropic).
        
           | palmer_fox wrote:
           | Very interesting. Thanks for replying!
        
       | notsahil wrote:
       | Model Stats - Architecture: LLAMA-like model with multi-query
       | attention - Objectives Fill-in-the-Middle, Chat - Tokens context:
       | 4096 - Pretraining tokens: 1.2T - Finetuning tokens: 40B -
       | Precision: bfloat16 - GPUs 64 NVidia A5000 - Training time 28
       | days
        
         | [deleted]
        
       | brucethemoose2 wrote:
       | One misleading thing is the notion that you need a 1-2B model to
       | run on commodity hardware.
       | 
       | This is not really true. Llama 7B runs with Vulkan/llama.cpp on
       | ~8GB smartphones and ~12GB laptops. That ease is going to get
       | much better over time, as lower RAM hardware starts dropping out
       | of the market and the Vulkan implementations get more widespread.
       | 
       | For users trying to run LLMs on 8GB or less machines, the AI
       | Horde approach of distributed models seems much more practical
       | anyway.
        
         | jmorgan wrote:
         | This is true! Although I'm also really excited at the potential
         | speed (both for loading the model and token generation) of a 1B
         | model for things like code completion.
        
         | naillo wrote:
         | 7b runs on my 4gb vram machine (8gb memory). I.e. quantization
         | helps a lot too
        
         | smcleod wrote:
         | Yeah but I remember thinking to myself every few years that
         | surely next year will be the year that base model machines
         | start at 32/64/... GB - but alas, it's nearly the end of 2023
         | and your average computer still seems stuck on a measly 16GB! I
         | don't think average RAM size on consumer machines has increased
         | at all in the last 8~ years or so.
        
           | Retric wrote:
           | It actually kind of makes sense.
           | 
           | RAM is only about 6x the speed of SSD's for sequential
           | access. Most people don't actually need truly random access
           | to all that much data rather than streaming video or loading
           | video game assets to their GPU. So they shift spending to
           | other components like video card, monitors, etc that actually
           | provide significant value.
           | 
           | Which is how you get people with 16 GB of system RAM using
           | graphics cards that also have 16GB of RAM.
        
         | btown wrote:
         | Ah, but have no fear - as lower RAM hardware starts dropping
         | out of the market, the RAM usage of Microsoft Teams will
         | increase to compensate!
         | 
         | (Not even /s - while the developers of LLM applications may
         | have 64GB RAM in their laptops or desktops, the less-technical
         | early adopters of LLMs running locally are likely to be power
         | users with lower-powered laptops, much more stringent RAM
         | limits, and numerous line-of-business applications and browser
         | tabs contending for that RAM. Causing those applications to be
         | swapped onto disk will almost certainly result in a degraded
         | overall experience that could easily be blamed on the LLM
         | application itself.)
        
         | nacs wrote:
         | Yes, 7B is perfectly usable on low-end hardware if you're using
         | it for instruction tuning/chat.
         | 
         | But for code completion in an IDE where it has to react as you
         | type, every 100 millisecond delay in response time is
         | noticable.
         | 
         | Even with a 24GB GPU, a 7B model doesn't feel snappy enough for
         | code-completion in an IDE.
        
           | brucethemoose2 wrote:
           | This can be addressed with token streaming and input caching.
           | 
           | Would that be enough? _shrug_
        
         | swyx wrote:
         | > the AI Horde approach of distributed models seems much more
         | practical anyway.
         | 
         | i wasnt aware this was a term of art. is there a definitive
         | blogpost or product explaining this approach?
        
           | ukuina wrote:
           | This is a reference to Kobold Horde, a distributed volunteer
           | network of GPUs that can be inferenced upon.
        
             | brucethemoose2 wrote:
             | ^
             | 
             | I didn't mean to imply splitting llama up between machines
             | (though that is a thing with llama.cpp), but a pool of
             | clients and servers who make requests and process them:
             | 
             | https://lite.koboldai.net/
             | 
             | A few users with half decent PCs can serve a much larger
             | group of people, and the "lesser" hosts can host smaller
             | models to "earn" access to larger ones.
        
         | palmer_fox wrote:
         | Perhaps the wrong thread to ask this question... Is it not
         | possible to load a model on something like an NVMe M.2 drive
         | instead of RAM? It's slower of course, but only 5-10x if I
         | understand correctly.
        
           | kirill5pol wrote:
           | Yes but they're slow enough on normal hardware for that 5-10x
           | to be painful...
        
       | igammarays wrote:
       | For the sake of not giving Microsoft and a few other tech giants
       | immense power over the world, I really do hope the cost and
       | efficiency of LLMs improve dramatically, until we can get
       | GPT-4-equivalent models trained on a few graphics cards and
       | running offline on an iPhone. Really rooting for these kinds of
       | projects until someone makes the breakthrough.
        
         | taywrobel wrote:
         | You may be interested in what we're working on at Symbolica AI.
         | 
         | We're using formal logic in the form of abstract rewrite
         | systems over a causal graph to perform geometric deep learning.
         | In theory it should be able to learn the same topological
         | structure of data that neural networks do, but using entirely
         | discrete operations and without the random walk inherent to
         | stochastic gradient descent.
         | 
         | Current experiments are really promising, and assuming the
         | growth curve continues as we scale up you should be able to
         | train a GPT-4 scale LLM in a few weeks on commodity hardware
         | (we are using a desktop with 4 4090's currently), and be able
         | to do both inference and continual fine tuning/online learning
         | on device.
        
           | paulsutter wrote:
           | Especially interested in learning directly on geometries,
           | please keep us updated and share results
        
             | taywrobel wrote:
             | Would definitely recommend Bronstein et. al's work on
             | geometric deep learning! https://geometricdeeplearning.com
             | 
             | That's effectively the right hand side of the bridge that
             | we're building between formal logic and deep learning. So
             | far their work has been viewed mainly as descriptive,
             | helping to understand neural networks better, but as their
             | abstract calls out: "it gives a constructive procedure to
             | incorporate prior physical knowledge into neural
             | architectures and provide principled way to build future
             | architectures yet to be invented". That's us (we hope)!
        
           | arthurcolle wrote:
           | I would like to subscribe to your newsletter, we'd be super
           | interested in this at Brainchain AI.
           | 
           | Drop me a link at (my first name) @ brainchain dot AI if
           | you'd like to chat, I'd love to hear more about what you're
           | working on!
        
           | dmarchand90 wrote:
           | Really cool stuff! Do you have any recommendations of where
           | we could learn more?
        
           | krak12 wrote:
           | [dead]
        
           | pawelduda wrote:
           | Sounds cool, but what are the drawbacks?
        
             | k__ wrote:
             | It doesn't exist at scale yet.
        
             | taywrobel wrote:
             | Biggest drawback is that since the structure is all
             | discrete, it is inherently weak at modeling statistical
             | distributions. For example, it'll likely never best a
             | neural network at stock market prediction or medical data
             | extrapolation.
             | 
             | However, for things that are discrete and/or causal in
             | nature, we expect it to outperform deep learning by a wide
             | margin. We're focused on language to start, but want to
             | eventually target planning and controls problems as well,
             | such as self-driving and robotics.
             | 
             | Another drawback is that the algorithm as it stands today
             | is based on a subgraph isomorphism search, which is hard.
             | Not hard as in tricky to get right like Paxos or other
             | complex algorithms; like NP-Hard, so very difficult to
             | scale. We have some fantastic Ph.Ds working with us who
             | focus on optimization of subgraph isomorphism search, and
             | category theorists working to formalize what constraints we
             | can relax without effecting the learning mechanism of the
             | rewrite system, so we're confident that it's achievable,
             | but the time horizon is unknown currently.
        
           | KRAKRISMOTT wrote:
           | > _We're using formal logic in the form of abstract rewrite
           | systems over a causal graph to perform geometric deep
           | learning. In theory it should be able to learn the same
           | topological structure of data that neural networks do, but
           | using entirely discrete operations and without the random
           | walk inherent to stochastic gradient descent._
           | 
           | Abstract rewrite like a computer algebra system's (e.g.
           | Wolfram) term rewriting equation simplication method?
        
             | taywrobel wrote:
             | Heavily influenced by Wolfram's work on metamathematics and
             | the physics project, in so far as using a rewrite system to
             | uncover an emergent topology; we're just using it to
             | uncover the topology of certain data (assuming that the
             | manifold hypothesis is correct), rather than the topology
             | of fundamental physics as he did.
        
         | fnordpiglet wrote:
         | I think with or without algorithmic advantages hardware will
         | improve for local model running. There's an immense amount of
         | capital being invested in hardware improvement and that will
         | absolutely trickle down.
         | 
         | My sincere belief is that local models is the way of the
         | future, with flexible base models adapted via Lora and context
         | to specific use cases. I think open source models and
         | techniques are inexorable at this point barring some sort of
         | regulatory moat and will rival commercial models in all but
         | extreme cases.
        
         | adrenvi wrote:
         | That could also help tech giants build even larger/more capable
         | models cheaply. Ideally there would be a hard ceiling of LLM
         | capability that even massive amounts of hardware couldn't
         | exceed, allowing inexpensive hardware to catch up.
        
           | a_wild_dandan wrote:
           | I personally hope that LLMs have no such limits. The good
           | these tools can do is immeasurable.
           | 
           | I can already run Llama 2 @70b on my laptop, and that'll look
           | like a quaint old AI artifact in 5-7 years. I think the
           | consumer market will keep pace yet stay well below SotA, just
           | as it always has. That still leaves plenty of room for
           | incredible open-source stuff!
        
         | axpy906 wrote:
         | The key in that is models. Per the GPT4 leaked details, it's
         | not a a single model but 16 MOE mixture of experts. There's
         | probably quite a lot of complexity on the backend in sourcing
         | the right model for the right query. In short, it's probably
         | better to focus on single models for specific tasks in the OS
         | community as evidenced by Code Llama. Having a system like GPT4
         | is still difficult to replicate. Getting it to run on a
         | consumer hardware for specific tasks like code gen at almost
         | GPT4 level is doable.
        
           | og_kalu wrote:
           | >There's probably quite a lot of complexity on the backend in
           | sourcing the right model for the right query.
           | 
           | This isn't how Sparse MoE models work. There isn't really any
           | complexity like that. And different models will or can pick
           | each token.
           | 
           | Sparse models aren't an ensemble of models.
        
           | [deleted]
        
           | ttul wrote:
           | There are many MoE architectures and I suppose we don't know
           | for sure which OpenAI is using. The "selection" of the right
           | mix of models is something that a network learns and it's not
           | a complex process. Certainly no more complex than training an
           | LLM.
        
             | axpy906 wrote:
             | When I wrote "backend" was a poor choice of a word. "Meta-
             | model" is probably a better choice of wording.
             | 
             | I hope it did not detract too much from the point of
             | focusing on subtasks and modalities for FOSS as GPT 4 was
             | built on a $163 million budget.
             | 
             | Finally, good point. We've got no idea of what OpenAI's MoE
             | approach is and how it works. I went back to Metas 2022
             | NLLB-200 system paper and they didn't even publish the
             | exact details of the router (gate).
        
         | smoldesu wrote:
         | > For the sake of not giving Microsoft and a few other tech
         | giants immense power over the world
         | 
         | I agree with and appreciate the sentiment, but it feels _way_
         | too late for that. These people do have and exert direct
         | control over pretty much all of our digital devices. It 's
         | funny (or sad) that we only seem to care about this when shiny
         | doodads like AI come around every so-often.
        
         | stainablesteel wrote:
         | to be fair, if that is achieved then the massive models that
         | tech giants produce will probably be phenomenal
        
         | [deleted]
        
         | flangola7 wrote:
         | I don't, how do you maintain control and prevent mass harm in
         | that case? I don't see anyway out other than similar
         | gatekeeping we apply to ownership and use of high explosives
         | and radiological weapon tooling.
         | 
         | At all other times I support tech freedom. I use libre
         | software, I use Tor, I donate to privacy and FOSS organizations
         | constantly. I only write my software projects under an AGPL
         | license. AI is qualitatively different. A world run amok with
         | intelligent infinite Sybils is not good for anyone. I hope
         | massive compute continues to be necessary, it may be the only
         | hard chokepoint we have to keep a handle on the beast.
        
       | Manjuuu wrote:
       | Another model that we'll soon forget it ever existed.
        
       | holoduke wrote:
       | Whats the difference between 1% and 99% of HumanEval? What does
       | it tell really?
        
         | kateklink wrote:
         | for pass@1 HumanEval tells how well the model solves a task
         | from a set, given only one chance to solve it. It's not the
         | perfect metric, there're other like DS-1000, MBPP (we have
         | included them on HuggingFace model card). HumanEval is good for
         | benchmarking with other models as it gives a fast idea how
         | powerful the model is.
        
           | swyx wrote:
           | > given only one chance to solve it
           | 
           | my understanding is that there are 2 usages of the
           | pass@{number} syntax. the HumanEval/Codex paper interprets
           | the {number} as number of attempts[0]. however language
           | modelers seem to use it to denote the number of few shot
           | example demonstrations given in the context. these are
           | starkly different and i wish the syntax wasnt overloaded
           | 
           | ---
           | 
           | [0] https://arxiv.org/pdf/2107.03374.pdf
           | 
           | > Kulal et al. (2019) evaluate functional correctness using
           | the pass@k metric, where k code samples are generated per
           | problem, a problem is considered solved if any sample passes
           | the unit tests, and the total fraction of problems solved is
           | reported.
        
       | mholubowski wrote:
       | Hey, I have a genuine question:
       | 
       | What is the point of a new model that isn't better than the best
       | possible model (example: OpenAI GPT-4)?
       | 
       | What's the point in having a smaller model? Who cares?
       | 
       | ---
       | 
       | This is a real, genuine question that I don't have a clear answer
       | to. Excuse my ignorance, plz enlighten your boi.
        
         | notsylver wrote:
         | IMO, the main reasons are (but are definitely not limited to):
         | 
         | - You can fine tune these models for very specific tasks, which
         | GPT-4 might not be as good at.
         | 
         | - Open source models are free. You can use them as much as you
         | want without worrying about a $xx,xxx bill at the end of the
         | month which makes tinkering with them easier.
         | 
         | - Smaller models like this can run on consumer hardware, even
         | phones, and can run offline.
         | 
         | - Privacy and not having to abide by a third parties terms. You
         | don't have to deal with "As a large language model...",
         | especially with uncensored models.
         | 
         | - Tools like jsonformer https://github.com/1rgs/jsonformer are
         | not possible with OpenAIs API.
         | 
         | - It's also just really cool, let's be honest.
        
         | tiborsaas wrote:
         | Your questions sounds like why do we need Alpine linux when we
         | have Ubuntu? Why do we have SQLite when we have Postgres?
         | 
         | I think the point is to reach a baseline of something being
         | super lightweight yet still useful that could be production for
         | a number of use cases.
        
         | seydor wrote:
         | Imagine being on Mars and running on a small PV panel and
         | needing to code a bugfix in your oxygen supply system through
         | the wire with Microsoft Earth(tm) or something
        
         | TuringNYC wrote:
         | The other answers are great, but to add more
         | 
         | - You can run it behind an air-gap, where your systems are
         | disconnected from the world.
         | 
         | - You can run it on the edge with low or no internet
         | connectivity
         | 
         | - You do not need to worry about breaching geographic data
         | restrictions, e.g.: medical data from Country X cannot leave
         | Country X
        
         | [deleted]
        
         | yieldcrv wrote:
         | 1) people can run a 1.6B model for free on consumer hardware
         | 
         | 2) any model that's run on computational resources you are
         | owning or leasing will have more privacy than an explicit cloud
         | offering. running completely on your own local hardware will be
         | private. this means you don't have to think twice about asking
         | the LLM about the proprietary code or information you are
         | working on.
         | 
         | 3) smaller models gain the performance improvements from all
         | the other improvements in interpreters and quantizing, allowing
         | for even more consumer friendly offline use
         | 
         | 4) oh yeah, offline use. could expand use cases to having LLM's
         | baked into operating systems directly, including leading phones
         | 
         | 5) showing what's possible, pushing towards the benchmarks of
         | the best possible model while using less computational
         | resources. this also makes the hosts of the best possible model
         | realize that they could either A) be using less computational
         | resources and increasing the bandwidth for their users B)
         | further improve their own model because of competition.
         | Basically if ChatGPT 4 was using similar improvements in
         | technology across all areas of reasoning/whatever, there never
         | would have been a rate limit on ChatGPT 4.
         | 
         | 6) more demand for other computational resources. Nvidia is
         | backordered till maybe Q2 2024 right now. If people realize AMD
         | or even their ARM chips can offer same performance with the
         | right combination of hardware and software, It alleviates
         | pressure on other ventures that want computation power.
        
         | SparkyMcUnicorn wrote:
         | You can use it 100% locally, and it doesn't cost anything.
        
         | [deleted]
        
         | yunwal wrote:
         | GPT4 is expensive to run, even more expensive to finetune, and
         | for all practical purposes can't be run offline (because the
         | model is too big to run outside of a huge data center).
         | Evaluation latency is also an issue for many usecases, and you
         | have to share your query with openai, so you can't run
         | sensitive queries. The output is also controlled/censored by
         | OpenAI.
         | 
         | Here's a few usecases that I wouldn't want to use OpenAI/GPT
         | for
         | 
         | - Advanced autocomplete for texting and private communications
         | 
         | - Querying sensitive document databases like emails
         | 
         | - Traveling in low connectivity areas
         | 
         | - Politically incorrect usecases (generating erotic content for
         | example)
         | 
         | List kinda goes on and on
        
           | qeternity wrote:
           | > GPT4 is expensive to run, even more expensive to finetune
           | 
           | GPT4 can't even be finetuned at the moment (though I expect
           | that to change).
        
             | MichaelBurge wrote:
             | It can be finetuned. Bing is a finetuned GPT-4.
        
       | acheong08 wrote:
       | Say I want to fine tune a Golang specific model. How much $ and
       | effort would I have to put in? Would using this as a base help in
       | any way compared to starting from llama?
        
         | OlegKlimov1337 wrote:
         | Maybe it makes sense to start from llama-code, not llama :D I
         | think golang specific model will not be that different from a
         | multi-language model. But it definitely will work better after
         | fine tuning on your code. Check out refact self hosting docker
         | in a couple of days, finetune will be there soon. It will take
         | you 1 GPU and almost no money )
        
       | howon92 wrote:
       | Congrats on your achievement! I'm curious about your end goal. Do
       | you aim to beat GitHub Copilot's performance and convince devs to
       | use Refact for code completion instead of GitHub Copilot? I want
       | to understand the motivation behind these different code-
       | completion models that are not solely for academic research.
        
         | kateklink wrote:
         | we want to help developers who need either on-premise or
         | permissive code assistant, copilot has neither of this. We also
         | wanted to lower the barriers for self-hosting, so that the
         | model is available on most GPUs with just 3GB Ram. Plus making
         | the code completions fast and efficient (understanding entire
         | context, not just the previous tokens).
        
         | OlegKlimov1337 wrote:
         | You can use it in practice, that was the goal of that
         | particular model! It's fast, runs on your own hardware if you
         | want it to.
        
       | glutamate wrote:
       | License text:
       | https://drive.google.com/file/d/16NqKiAkzyZ55NClubCIFup8pT2j...
       | [PDF]
       | 
       | See last page for restrictions
        
         | Havoc wrote:
         | Thanks. That looks pretty relaxed on terms
        
         | lordofgibbons wrote:
         | > In any way that violates any applicable national, federal,
         | state, local or international law or regulation;
         | 
         | Darn! Foiled again! I was planning on breaking some federal
         | laws, but the license says that I can't ;( \s
         | 
         | Open-RAIL license has the be the worst license in existence
         | claiming to be "open".
         | 
         | > You shall undertake reasonable efforts to use the latest
         | version of the Model.
         | 
         | Plea to folks releasing models: Please stop using this user-
         | hostile and deranged license
        
       | Havoc wrote:
       | That's an impressive result
       | 
       | The open rail license seems to reference some sort of limitations
       | on safety and unethical use but I can't see where in the repo
       | that's spelled out precisely what the authors have in mind?
        
         | [deleted]
        
       | vikp wrote:
       | This post is misleading, in a way that is hard to do
       | accidentally.                 - They compare the performance of
       | this model to the worst 7B code llama model.  The base code llama
       | 7B python model scores 38.4% on humaneval, versus the non-python
       | model, which only scores 33%.       - They compare their instruct
       | tuned model to non-instruct-tuned models.  Instruction tuning can
       | add 20% or more to humaneval performance.  For example, WizardLM
       | 7B scores 55% on humaneval [1], and I've trained a 7B model that
       | scores 62% [2].       - For another example of instruction
       | tuning, Stablecode instruct tuned benchmarks at 26%, not the 20%
       | they cite for the base model [3]       - Starcoder, when prompted
       | properly, scores 40% on humaneval [4]       - They do not report
       | their base model performance (as far as I can tell)
       | 
       | This is interesting work, and a good contribution, but it's
       | important to compare similar models.
       | 
       | [1] https://github.com/nlpxucan/WizardLM
       | 
       | [2] https://huggingface.co/vikp/llama_coder
       | 
       | [3] https://stability.ai/blog/stablecode-llm-generative-ai-
       | codin...
       | 
       | [4] https://github.com/huggingface/blog/blob/main/starcoder.md
        
       | umutisik wrote:
       | The title is misleading This model is not "SOTA for the size",
       | there are smaller models that do 10-18% better in absolute score.
       | The text says it's SOTA "among similar models" where they
       | probably compare with other models with permissive licensing.
        
         | mrob wrote:
         | "Permissive" usually refers to Free Software or Open Source
         | licenses without copyleft requirements. OpenRAIL is a
         | proprietary license because it imposes usage restrictions,
         | contrary to both the Free Software and Open Source definitions.
        
         | OlegKlimov1337 wrote:
         | AFAIK There is only one model that do better, it's phi-1 and
         | it's python only, and it does not support fill-in-the-middle so
         | you can't really use it.
        
           | umutisik wrote:
           | Phi-1-small also scores higher with 350M parameters. It helps
           | to be specific about what the comparison is against when
           | claiming SOTA.
        
       | ldjkfkdsjnv wrote:
       | I dont trust any benchmarks for any LLM thats not coming from FB,
       | Google, OpenAI, Anthropic, or Microsoft. These models are so
       | dynamic, the simple benchmark numbers never tell the whole story
       | of the quality of the model. Take for instance, a recent posting
       | by anyscale, claiming their fine tuning of Llama 2 was
       | competitive with OpenAI's model. The reality being their fined
       | tuned model is basically worthless, and was competitive along a
       | single metric/very narrow commoditized task. Its a great way to
       | get clicks by posting these metrics though
        
         | breadsniffer01 wrote:
         | They could have easily benchmarked with the Spider SQL test set
         | but they didn't.
         | 
         | I have a feeling that the more robust models might be the ones
         | that don't perform best on benchmarks right away.
        
         | SparkyMcUnicorn wrote:
         | The community has fine-tuned some really good llama models
         | (much better than llama-chat), but I get what you're saying.
         | 
         | I've been testing the best performing models on the huggingface
         | leaderboard lately. Some of them are really impressive, and
         | others are so bad that I second guess the prompt format or if
         | the benchmarked model is actually the same one I'm testing.
        
           | breadsniffer01 wrote:
           | Which models were really bad?
        
             | SparkyMcUnicorn wrote:
             | I was keeping track of the good ones, and don't have many
             | notes on the bad ones.
             | 
             | I do remember testing "LoKuS" last week and it was quite
             | terrible (sometimes gave completely off-topic answers). It
             | scored as one of the highest 13B models on the leaderboard
             | (~65 average), but appears to be removed now.
        
         | nomel wrote:
         | This is the goal of humaneval, correct?
        
       | zcesur wrote:
       | tangentially related: refact recently shared 4 bounties worth
       | $9,000 to help improve their tech!
       | 
       | https://algora.io/org/smallcloudai/bounties
       | 
       | disclaimer: i'm a cofounder of algora, the platform enabling
       | these bounties
        
         | [deleted]
        
       | kateklink wrote:
       | We've finished training a new code model Refact LLM which took us
       | about a month. The main use-case is for blazing-fast code
       | completion with fill-in-the-middle, additionally, the model could
       | reply to chat prompts.
       | 
       | It has much better performance than all of the code models of
       | similar size, and almost reaches the same HumanEval as Starcoder
       | being 10x smaller in size.
       | 
       | With the small size, it can work with most modern GPUs requiring
       | just 3GB Ram.
       | 
       | You can try self-hosting it in Refact
       | https://github.com/smallcloudai/refact/ and get a local fast
       | copilot alternative with decent suggestions.
       | 
       | Weights and model card
       | https://huggingface.co/smallcloudai/Refact-1_6B-fim.
       | 
       | We would love to hear your feedback!
        
         | drcongo wrote:
         | Is it possible to run it as an LSP so that it can be used in
         | editors other than VSCode and JetBrains? (sorry if this
         | question is completely mad, my understanding of how these
         | things work is extremely limited)
        
           | OlegKlimov1337 wrote:
           | Yes, it's coming up in a couple of weeks.
        
             | drcongo wrote:
             | Great, thanks. I'll keep an eye out.
        
         | [deleted]
        
         | diminish wrote:
         | Does ctransformer
         | (https://github.com/marella/ctransformers#supported-models)
         | support running refact?
         | 
         | I see that model type "gpt_refact" in
         | https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/mai...
        
         | ALittleLight wrote:
         | How does it compare to Copilot? A metric I'd like to see is %
         | of proposed completions accepted by a human user. If you had an
         | extension that 50% of the time proposed a Copilot extension and
         | 50% of the time proposed a Refact extension (blind to the user)
         | then you could come up with a metric like this.
        
         | riku_iki wrote:
         | > almost reaches the same HumanEval
         | 
         | how can you tell that HumanEval is not leaked to your training
         | data in some form?
        
           | mityamitya wrote:
           | Hi! We ran LSH filtering over datasets to remove all code
           | that can be similar to HumanEval samples.
        
             | riku_iki wrote:
             | so, we have to trust your procedure..
        
       ___________________________________________________________________
       (page generated 2023-09-04 23:00 UTC)