[HN Gopher] CodeCompose: A large-scale industrial deployment of ... ___________________________________________________________________ CodeCompose: A large-scale industrial deployment of AI-assisted code authoring Author : azhenley Score : 112 points Date : 2023-06-03 13:38 UTC (9 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | m3kw9 wrote: | Who else's is tired of these trillion dollar companies talk and | talk and have no products? | neuronexmachina wrote: | It's a paper about a (for now) internal product: | | "In this paper we present CodeCompose, an AI-assisted code | authoring tool developed and deployed at Meta internally. | CodeCompose is based on the InCoder LLM that merges generative | capabilities with bi-directionality. We have scaled up | CodeCompose to serve tens of thousands of developers at Meta, | across 10+ programming languages and several coding surfaces." | | Looks like the InCoder model it's based on can be downloaded | here: https://huggingface.co/facebook/incoder-6B | Veuxdo wrote: | Does companies publishing papers about internal tools serve | any purpose other than PR? | chromatin wrote: | There are many scholars working in industry (from software | engineering to biotech) who believe in the ideals of | information sharing and publication of research. | [deleted] | IshKebab wrote: | Of course it does. They give lots of details about the | tools and their use which is obviously helpful to anyone | wanting to do something similar. | | A great example is | https://www.uber.com/blog/research/keeping-master-green- | at-s... | | I think maybe Gitlab Merge Trains predate it, but it was | definitely influential. | lmeyerov wrote: | It was a useful read for me, esp seeing numbers on fine- | tuning & use. We are piloting a DB analyst tool where users | can use natural language to do DB queries, generate AI | analyses, make interactive GPU charts, etc, so many nearby | questions we think about a lot. Previously as a PhD | publishing on program synthesis, most of our writeups were | much smaller scale wrt live user evals, so all combiner... | super cool to see. | | For FB... It probably helps keep the team valued internally | + helps with retention & recruiting. For PhD trained types, | this kind of paper is almost table stakes. | | Less obvious... FB has been laying off teams like this | despite productivity ROI intuitions, so if I was there, I'd | be careful to quantify current + future ROI - I'm sure | there are key #'s not being shared. | williamstein wrote: | This paper says: "Customized for the organization: | CodeCompose is fine-tuned on Meta's internal code | repository, and is thus able to handle Meta-specific | languages such as Hack and Flow." If you work at an org | that might want to build their own LLM trained on their own | internal code base, then the lessons of this paper would be | of value to you. | Veuxdo wrote: | Makes sense. For Meta, though, by publishing papers like | this are they hoping for something else other than PR? My | only other guess would be attracting talent. | bee_rider wrote: | Lots of companies have research departments and release | papers; the people working in these departments have some | academic roots at least. The incentives for releasing | papers are: | | * To raise your profile and reputation generally | | * The specific publish or perish incentives in academia | | * Because you really think you've done something | interesting and novel, and want to share it with the | world | | Only the middle one is removed when going to industry. | alan-stark wrote: | The abstract says _..we present metrics from our large-scale | deployment of CodeCompose that shows its impact on Meta 's | internal code authoring experience over a 15-day time window, | where 4.5 million suggestions were made by CodeCompose. | Quantitative metrics reveal that (i) CodeCompose has an | acceptance rate of 22% across several languages, and (ii) 8% of | the code typed by users of CodeCompose is through accepting code | suggestions from CodeCompose. Qualitative feedback indicates an | overwhelming 91.5% positive reception for CodeCompose._ | | In other terms, out of 4.5 million suggestions about 80% were | off, yet there is 91% positive reception. That's 3.6 million | rejected suggestions that potentially distracted programmers from | doing their work. Yet users are happy. Is there a contradiction | in these figures? | YetAnotherNick wrote: | If you take random question from stack overflow, my guess is | that 80% of them don't have correct answer, yet I am very happy | stackoverflow exists. | Mountain_Skies wrote: | I've had Bing provide me with code from SO that was from the | question, which was code that was explicitly stated to not | work and the poster wanted to know what was wrong with it. | Bing's AI didn't understand this and claimed it was a | solution. | afro88 wrote: | Think of it like traditional code completion. It's mostly wrong | but still useful. You either type through it, or tab/arrow to | select the correct completion. | | AI code completion (like Github Copilot) is like this. Still a | time saver overall, even with a low acceptance rate. | alan-stark wrote: | Reading these answers reminded me why I love HN - actually | thoughtful perspectives :) Guess a lot boils down to two | variables - (a) suggestion UX quality and (b) definition of | 'rejection' event. I skimmed through the paper and it turns out | that 91% figure is based on feedback from 70 people and | anonymous feedback wasn't allowed. So, 'overwhelming 91% | favorable' can be paraphrased to `64 people out of the total | 16k user base said they liked it'. Would be interesting to see | indirect metrics like retention on day 15. | idiotsecant wrote: | Quite an insightful comment. In an institution that large | it's surprising there were only 64 brown nosers. I expect out | of 16k captive audience employees you could probably get 64 | people to give a positive opinion of replacing paychecks with | meta store scrip. | seanmcdirmid wrote: | A lot of time suggestions are provided but not used because you | already knew the answer and typed fast enough not to take it. | moonchrome wrote: | It's easy to : | | - anticipate when the suggestions are likely to be useless and | not even bother | | - scan the proposals to see if they are what you want in cases | it's useful | | It's a boilerplate generator and you're happy when it saves you | tedious mental effort. | rychco wrote: | I treat it the same way I do pre-LLM LSP suggestions, which is | basically inline documentation lookup. 'Oh what was that | function name for inserting something at the end? PushB- no, | InsertAft- no, App - end! Yea that's it' | | In this case it gave me 3 suggestions but I only accepted 1. I | could see this taking 5-10 suggestions for an LLM to when it's | not something as straightforward as a function name. It's still | very useful despite this low acceptance rate | fnordpiglet wrote: | I'd say it's hard to argue with the positive impression of the | engineer using it. If they find it's suggestions helpful it's | not a distraction, it's helpful. | | Using GitHub copilot daily I find it's suggestions often | nonsense but interesting to see regardless. Often for | boilerplate it's spot on and it saves me dozens of lines of | typing. But it also suggests stuff on every key stroke many of | which I just type through, similar to intellisense. Assuming | Metas code thingy is better, I would find myself in that 91%, | as I'm already there with what's available to the general | public. | | My only gripe, fwiw, with copilot in vscode is it interferes | with intellisense. Often I want to see the code completion from | both, but copilot jumps in before intellisense and the | intellisense never renders and I use it as an inline api | reference. Sometimes it's so frustrating I have to turn off | copilot. But, copilot is generally useful enough that I | reenable it once I've understood the api stuff I'm unsure of. | There's some escape backspace period dance I can do that | sometimes let's intellisense win. I've not dug deeply enough | into vscode configuration to know if there's some parameter to | tweak the race conditions. I'd note that when intellisense | renders first copilot still renders its suggestions but the | other way doesn't work. | pavlov wrote: | I think the 8% number better explains why users were so | overwhelmingly happy. Assuming the suggestions in general are | not distractingly wrong, then 8% of code automatically written | is a decent amount of time saved researching solutions. | visarga wrote: | Interesting that 91% find it useful but only 8% of the code | is generated by LLM. This is even with a LLM tuned on the | internal codebase. This will give a mild boost but not | replace anyone. | layer8 wrote: | But only 22% are accepted for those 8%, which means that the | 78% code suggestions that are not accepted correspond to an | equivalent of over 28% of all code written. Not sure that | having to spend the time evaluating an additional 28% of code | in vain amounts to an overall win. | | Though I guess the success rates when using Stack Overflow | aren't too dissimilar. | cloudking wrote: | Have you tried GitHub Copilot? You don't have to accept the | code suggestions, so they don't really distract you or get in | the way once you get used to the UX. | tablatom wrote: | I find them extremely distracting. Evaluating a suggestion | is, for me, an entirely different mental process from the | creative process I'm in the middle of. The tagline that | copilot helps you stay in the flow is very much not my | experience. | | I am well aware that others are having a different experience | with it. | bredren wrote: | The Industrial Challenges section of the paper addresses | specific areas of flow disruption they focused on. | | Some folks may never accept AI code completion / | suggestions (like some prefer vim over modern IDEs) but at | least people working on this stuff can describe points | known to focus on. | cloudking wrote: | I've found I am naturally ignoring the large complex | suggestions because they usually have mistakes, and | accepting the small easy suggestions. I respect your | experience though, to each their own. | irthomasthomas wrote: | Mine doesn't even make complex suggestions. I can't get | it to suggest more than one line at a time. Wonder what's | different? I'm on the beta. | baq wrote: | The thing can generate whole unit tests if you leave it a | one-like description in a comment next to the function | you want tested. It's actually amazing. | cloudking wrote: | For example, sometimes I'll start out with a code comment | for a function, hit enter and the next line suggestion | will be the entire function. | wpride wrote: | Everyone in this space seems to be building on the LSP and | classic auto-complete in particular as their UI. But I've found | this to be non ideal. | | - As mentioned in this paper I definitely do not want the AI | suggestion crowding out a suggestion generated directly from the | type bindings | | - I often do want the AI to write an entirely new block of | boilerplate. To do this you have to write a comment string | targeted at the AI, then delete this afterwards | | - Sometimes I'd just like the AI to explain to me what some code | does without writing anything | | - This isn't something I always want on; I find myself turning | the plugin on and off depending on the context | | Overall I think we need a novel UX to really unlock the AI's | helpfulness | anotherpaulg wrote: | I have been enjoying a chat based AI coding modality. I built | some tooling that gets rid of the need to cut & paste code | between the chat and your files. This makes chatting about code | changes much more ergonomic. My tool also integrates directly | with git, which provides a safety net. It's easy to undo | changes if the AI does something silly. | | Here are some chat transcripts that give a flavor of what it's | like to code with AI this way: | | https://aider.chat/examples/ | | My tool is open source, and currently only works if you have a | gpt-4 api key. | fnordpiglet wrote: | In vscode the Genie extension does these things and you can | provide your own contextual hooks with custom prompts. It's | particularly good at explaining syntax and semantic errors. | florbo wrote: | This echoes my sentiment exactly. My biggest gripe is when type | suggestions are replaced with AI suggestions, as I more often | just want to auto-complete a method/attribute. I frequently | find myself toggling AI suggestions via hotkey. | | As for the getting a suggestion by writing comments, an "insert | from prompt" action perhaps, or just a separate prompt | pane/popup/whatever-you-prefer combined with using good ol' | copy+paste would suffice. | stepanhruda wrote: | Does it need to be that novel of a UX? | | If you want to know what some code does, just select it & hit a | keyboard shortcut (or right click and choose explain from | menu). | | If you want AI to write code for you, write a comment starting | with a specific word, it suggests the implementation and you | can choose to accept & replace the comment with it. | rytill wrote: | What kind of novel UX are you imagining? | Animats wrote: | Hm. It seems to be like automated Stack Overflow. Only 8% of the | code comes from the AI system, but it's useful for getting | examples of how to do something. | | Hallucination about API calls was reported as a problem. I've | seen that one. There's an amusing, and seriously annoying, | tendency for these systems to make up some plausible API call | that does what you need, but doesn't exist. Maybe something | should collect up such suggestions as proposals for new API | calls. | freeone3000 wrote: | The future of API design -- "yes it would make SENSE if that | existed, but it doesn't" => now it does | ChatGTP wrote: | It's a funny game because they all need their own clones of each | model / product. | | Feels like tech is making billions but is a little lost ? | zeedude wrote: | Limit training to stackoverflow input and wham! we have automated | modern programming ;) | regularfry wrote: | I would very much like a local code assist tool. Assuming | integration with editors is my problem, what's best in class this | week if a) I have a respectable GPU; b) I don't, and need CPU- | only? | wsxiaoys wrote: | Check out https://github.com/TabbyML/tabby, which is fully | self-hostable and comes with niche features. | | On M1/M2, it offers a convenient single binary deployment, | thanks to Rust. You can find the latest release at | https://github.com/TabbyML/tabby/releases/tag/latest | | (Disclaimer: I am the author) | MisterAV wrote: | On Visual Studio there's an extension (by Microsoft) called | IntelliCode which is a small AI assistant that runs locally on | the CPU. It doesn't come close to these new large GPU models | but it's quite handy. It looks into what you're typing on the | current line and the previous activity along with the current | project and tries to predict the full line or even the same | change on multiple lines if that makes sense. | azhenley wrote: | We recently published a paper on IntelliCode and share some | of the usage numbers. | | https://austinhenley.com/pubs/Vaithilingam2023ICSE_IntelliCo. | .. | | Disclaimer: I'm one of the co-authors. | mormegil wrote: | I have just installed Fauxpilot | <https://github.com/fauxpilot/fauxpilot> (nVidia GPU-only) and | it works... OK. Still evaluating and I'm basically sceptic on | the whole concept, but... let's see. | Filligree wrote: | Nothing even comes close to copilot. I realise you said | "local", but if you insist on that you're going to be | disappointed. | synthiq wrote: | For anyone interested in related research, I used | https://mirrorthink.ai to find some background on the state-of- | the-art. | | (disclaimer: this is AI generated, but grounded on contents of | papers, with real references, so I'd say it is still | constructive) | | The state-of-the-art in code generation has seen significant | advancements with the deployment of large language models (LLMs) | in various code authoring tools. One such example is the study on | GitHub Copilot, Amazon CodeWhisperer, and ChatGPT [1], which | evaluates the code quality of these AI-assisted code generation | tools. The study reveals that ChatGPT generates correct code | 65.2% of the time, while GitHub Copilot and Amazon CodeWhisperer | achieve 46.3% and 31.1% correctness, respectively. These results | indicate that LLMs have made substantial progress in generating | high-quality code, but there is still room for improvement. | | Other research in the field has explored various techniques to | enhance code generation and assistance. For instance, RepoCoder | [2] focuses on repository-level code completion by integrating | code generation and retrieval models in an iterative paradigm. | This approach considers the repository-level context, including | customized information such as API definitions and identifier | names, to improve code completion suggestions. Serenity [3] | leverages library-based Python code analysis for code completion | and automated machine learning. The authors explore the potential | of data flow analysis produced by Serenity to improve code | completion when combined with neural models. | | In addition to these advancements, the field has seen progress in | incorporating contextual information into code completion models. | The paper on enriching source code with contextual data [4] | investigates the impact of incorporating contextual information | on the performance of code completion models. The authors conduct | an empirical study to analyze the effectiveness of this approach. | These achievements, along with the advancements in LLMs, | contribute to the ongoing progress in code generation and | assistance. As the field continues to evolve, it is expected that | AI-assisted tools will become increasingly sophisticated and | effective in assisting developers with various aspects of the | software development process. | | [1] Evaluating the Code Quality of AI-Assisted Code Generation | Tools: An Empirical Study on GitHub Copilot, Amazon | CodeWhisperer, and ChatGPT - 2023: | https://arxiv.org/abs/2304.10778 | | [2] RepoCoder: Repository-Level Code Completion Through Iterative | Retrieval and Generation - 2023: https://arxiv.org/abs/2303.12570 | | [3] Serenity: Library Based Python Code Analysis for Code | Completion and Automated Machine Learning - 2023: | https://arxiv.org/abs/2301.05108 | | [4] Enriching Source Code with Contextual Data for Code | Completion Models: An Empirical Study - 2023: | https://arxiv.org/abs/2304.12269 | fabmilo wrote: | I would like to work in this code copilot space, I think will be | one of the fastest applications of LLms in the near future. I | have been working on a tool to autogenerate docstrings from a | python method in google format | bolinfest wrote: | If you want to skip the paper and watch the video: | https://youtu.be/ANDJ0TKjyWw | | Disclaimer: I am the person in the video. | muglug wrote: | It was a great video, and a great paper. | | As someone who writes quite a lot of Hack, I'm selfishly | interested in whether you plan to open-source this work (not | the weights, obviously, but everything else). ___________________________________________________________________ (page generated 2023-06-03 23:00 UTC)