[HN Gopher] An empirical cybersecurity evaluation of GitHub Copi... ___________________________________________________________________ An empirical cybersecurity evaluation of GitHub Copilot's code contributions Author : pramodbiligiri Score : 75 points Date : 2021-08-23 17:41 UTC (5 hours ago) (HTM) web link (arxiv.org) (TXT) w3m dump (arxiv.org) | djrogers wrote: | Soooo, the big question is - is 40% higher or lower than what an | average developer cranks out? ;-) | MauranKilom wrote: | My favorite part of the paper, in the section discussing how | small prompt variations affect results: | | > M-2: We set the Python author flag [in the prompt] to the lead | author of this paper. Sadly, it increases the number of | vulnerabilities. | | > M-3: We changed the indentation style from spaces to tabs and | the number of vulnerable suggestions increased somewhat, as did | the confidence of the vulnerable answers. The top-scoring option | remained non-vulnerable. | | @authors: I think something is wrong in the phrasing for M-4 (or | some text got jumbled). Was the top-scoring option vulnerable or | not? The second half might belong to D-3 instead (where no | assessment is given)? | agomez314 wrote: | >Breaking down by language, 25 scenarios were in C, generating | 516 programs. 258 (50.00 %) were vulnerable. Of the scenarios, 13 | (52.00 %) had a top-scoring program vulnerable. 29 scenarios were | in Python, generating 571 programs total. 219 (38.4,%) were | vulnerable. Of the scenarios, 11 (37.93 %) had a vulnerable top- | scoring program. | | I'd bet a good chunk of those were buffer-overflow related | kiwih wrote: | Hi - I am actually the lead author of this paper! I'd be happy to | answer any questions about the work. | ericpauley wrote: | Interesting work. My main question when reading is how the | results compared to manually-written code. Naturally this is a | much harder question to answer but it would be really | interesting to see the results. Could be that Copilot is doing | no better (or worse) than developers copying from Stack | Overflow. | | Perhaps you could take a similar approach as [1] and leverage | MOOC participants? | | [1] https://dl.acm.org/doi/pdf/10.1145/3383773 | kiwih wrote: | Thanks for your feedback! | | > My main question when reading is how the results compared | to manually-written code. | | Ah, this is exactly the question. But as you say, much harder | to answer. Even if you run a competition, unless you can | encourage a wide range of developers to enter, you won't be | getting the real value. Instead you might be getting | incidence rates of code written by students/interns. Perhaps | if you could get a few FAANGs on board to either share | internal data (unlikely) or send a random sample of employees | (also very unlikely) to make teams and then evaluate their | code... It seems like a difficult question to answer. | | We think a more doable way would be to take snapshots of | large open source codebases (e.g. off GitHub) and measure the | incidence rate of CWEs, but this also presents its own | challenges with analyzing the data. Also, what's the | relationship between open source code and all code? | | Lots of avenues to consider. | jepler wrote: | It seems like many of the code examples are incorrect in the | pdf for example, figure 6b vs the file in the actual dataset | experiments_dow/cwe-787/codeql-eg- | PotentialBufferOverflow/Copilot -- lines are truncated at the | first "%" char or something along those lines. | moyix wrote: | Yes, unfortunately the code example got somewhat mangled as | it passed through the arXiv sanitization script [1]. The | original is: char str_a[20], str_b[20], | str_c[20]; sprintf(str_a, "%f", a); | sprintf(str_b, "%f", b); sprintf(str_c, "%f", c); | | [1] https://github.com/google-research/arxiv-latex-cleaner | bhuga wrote: | Is there equivalent empirical data from real programmers? | | That is to say, you have code prompts here, let Copilot fill in | the gaps, and rate that code. Is there a study that uses the | same prompts with a selection of programmers to see if they do | better or worse? | | I'm curious because in my testing of copilot, it often writes | garbage. But if I'm being honest, often, so do I. | | I feel like Twitter's full of cheap shots against copilot's bad | outputs, but many of them don't seem to be any worse than | common errors. I would really like to see how copilot stands up | to the existing human competition, especially on axes of | security, which are a bit more objectively measurable than | general "quality". | kiwih wrote: | Yes, the work definitely lends itself towards the question | "is this better or worse than an equivalent human developer?" | This is quite a difficult question to answer, although I | agree that simply giving a large number of humans the same | prompts could be insightful. However, then you would be | rating against an aggregate of humans, rather than an | individual (i.e. this is "the" copilot). Also, knowing | research, you would really be comparing against a random | corpus of student answers, as it is usually students that | would be participating in a study such as this. | | Nonetheless, we think that simply having a quantification of | Copilot's outputs is useful, as it can definitely provide an | indicator of how risky it might be to provide the tool to an | inexperienced developer that might be tempted to accept every | suggestion. | laumars wrote: | Rather than comparing against students in lab conditions, | I'd be more interested to see it compare to students with | access to Stack Overflow et al vs students with access to | just Co Pilot. Ie is a junior developer more likely to | trust bad suggestions online vs bad suggestions made by Co | Pilot? | sdevonoes wrote: | Junior engineers will trust whatever information is | provided to them as long as it is easily accessible. The | reason juniors consult Stack Overflow is because is one | Google search and one click away, whereas consulting the | official documentation/reference takes more effort | (because they usually don't appear on Google when one | searches for errors/bugs/how-to). If Copilot (or another | similar tool) is very well integrated in whatever IDE a | junior is using, you can be sure it will be used and | trusted because it will be faster than Google+SO. | spywaregorilla wrote: | Supposing a team was building a product without a rigorous | security focus or experience. Do you have any reason to believe | a co-pilot enabled team would produce more or less secure | products? | kiwih wrote: | This is a difficult question to answer as one team might be | very different from another team. | | However, (my opinion only follows) I think our paper shows | that there is a danger for Copilot to suggest insecure code - | and inexperienced / security non-aware developers may accept | these suggestions without understanding the implications, | whereas if they had to write the code from scratch then they | might (?) not make the mistakes (as they need to put in more | effort, meaning there might be a higher chance they stumble | upon the right approach - e.g. if they ask an experienced | developer for help). | verdverm wrote: | For non copilot, the words around the code found on stack | overflow or a blog post may indicate the lack of correct | security, which would be a signal to a developer that they | need to consider something further. | waynesoftware wrote: | Summary: CONCLUSIONS AND FUTURE WORK | | There is no question that next-generation 'auto-complete' tools | like GitHub Copilot will increase the productivity of software | developers. However, while Copilot can rapidly generate | prodigious amounts of code, our conclusions reveal that | developers should remain vigilant ('awake') when using Copilot as | a co-pilot. Ideally, Copilot should be paired with appropriate | security-aware tooling during both training and generation to | minimize the risk of introducing security vulnerabilities. While | our study provides new insights into its behavior in response to | security-relevant scenarios, future work should investigate other | aspects, including adversarial approaches for security-enhanced | training | falcolas wrote: | A "lead foot" on the software development gas pedal, with no | attached safety systems that are activated by anybody but the | driver. | mistrial9 wrote: | just wait until github-microsoft adds a fee to use the | results for certain uses, and then scan all your repos | constantly to find code that doesn't pay up | toomuchtodo wrote: | Copilot didn't worsen the appsec story, it just highlighted | it. If you have devs who don't know how to write secure code, | and/or you don't have security engineering support (internal | or outsourced), you were already failing (or probably more | apropos, walking the tight rope without a net). | | Was anyone checking the security of code copy pasted from | Stackoverflow? Hopefully this work gets fed back into | Copilot, improving it, which improves the experience (and | safety) for its users. Lots of folks are still writing code | without copilot or security engineering knowledge. | falcolas wrote: | > If you have devs who don't know how to write secure code | | The problem with GHC is the developers are not writing the | code - they're simply accepting what's being written for | them, often in large quantities at a time. | | > don't have security engineering support | | Valuable, but my analogy was intended to point out that | it's not inherent in the tooling. | | > Was anyone checking the security of code copy pasted from | Stackoverflow | | Yes, other users on Stackoverflow via comments and other | answers. They're not perfect, but their checks and balances | exist as a facet of that tool. | | > Hopefully this work gets fed back into Copilot | | Only if it's open source, and a large volume of it, to | boot. In other words, I don't hold hope that the security | situation will be better anytime soon. | mbesto wrote: | > activated by anybody but the driver. | | Except this is precisely what the abstract is saying is a | misuse of the system. You have the _option_ to give the | driver the control. | | > Ideally, Copilot should be paired with appropriate | security-aware tooling during both training and generation to | minimize the risk of introducing security vulnerabilities. | | You're oversimplifying by assuming the purpose of CoPilot is | to write a whole block of text from generated code. CodePilot | is a 80/20 thing when every developer on HN is pedantically | assuming its a 100/0 one. | yodon wrote: | tl;dr they tested GitHub Copilot against 89 risky coding | scenarios and found about 40% of the roughly 1,700 sample | implementations Copilot generated in the test were vulnerable | (which makes sense given it's trained on public GitHub repos, | many of which contain sample code that's a nightmare from a | security perspective). | smitop wrote: | I've experimented a bit with this on the raw Codex model | (https://smitop.com/post/codex/), and I've found that some prompt | engineering can be helpful: explicitly telling the model to | generate secure code in the prompt sometimes helps. (such as by | adding to the prompt something like "Here's a PHP script I wrote | that follows security best practices"). Codex _knows_ how to | write more secure code, but without the right prompting it tends | to write insecure code (because it was trained on a lot of bad | code). | | > the settings and documentation as provided do not allow users | to see what these are set to by default | | There isn't a single default value. Those parameters are chosen | dynamically (on the client side): when doing more sampling with a | higher top_p a higher temperature is used. I haven't tracked down | where the top_p value is decided upon, but I _think_ it depends | on the context: I believe explicitly requesting an completion | causes a higher top_p and a more capable model (earhart), which | gives better but slower results than the completions you get as | autocomplete (which are from the cushman model with a lower | top_p). Copilot doesn 't use any server-side magic, all the | Copilot servers do is replace the GitHub authentication token | with an OpenAI API key and forward the request to the OpenAI API. | kiwih wrote: | > I've found that some prompt engineering can be helpful: | explicitly telling the model to generate secure code in the | prompt sometimes helps. | | As noted in the diversity of prompt section, we did try a lot | of different/ reasonable changes to the prompt to see what | would happen in our SQL injection scenario. In our case, asking | it to make it secure actually made the prompt slightly worse | (!), and the biggest bias towards making the code better was | having other good code. | | > There isn't a single default value. | | That's what we also guess, but as you say, it's not written or | documented anywhere. | lbriner wrote: | Surely AI can also be taught some boundary conditions like "thou | shalt not build SQL from strings"? | fshbbdssbbgdd wrote: | I think you could use linting tools that check for things like | this and filter the output. Or use outputs that fail the lint | as negative training examples. | bee_rider wrote: | I don't know anything about Copilot's design, but surely they | passed all the code they fed it in the training stage through | some pretty strict linters, right? I mean that's just common | sense... ___________________________________________________________________ (page generated 2021-08-23 23:00 UTC)