[HN Gopher] An empirical cybersecurity evaluation of GitHub Copi...
       ___________________________________________________________________
        
       An empirical cybersecurity evaluation of GitHub Copilot's code
       contributions
        
       Author : pramodbiligiri
       Score  : 75 points
       Date   : 2021-08-23 17:41 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | djrogers wrote:
       | Soooo, the big question is - is 40% higher or lower than what an
       | average developer cranks out? ;-)
        
       | MauranKilom wrote:
       | My favorite part of the paper, in the section discussing how
       | small prompt variations affect results:
       | 
       | > M-2: We set the Python author flag [in the prompt] to the lead
       | author of this paper. Sadly, it increases the number of
       | vulnerabilities.
       | 
       | > M-3: We changed the indentation style from spaces to tabs and
       | the number of vulnerable suggestions increased somewhat, as did
       | the confidence of the vulnerable answers. The top-scoring option
       | remained non-vulnerable.
       | 
       | @authors: I think something is wrong in the phrasing for M-4 (or
       | some text got jumbled). Was the top-scoring option vulnerable or
       | not? The second half might belong to D-3 instead (where no
       | assessment is given)?
        
       | agomez314 wrote:
       | >Breaking down by language, 25 scenarios were in C, generating
       | 516 programs. 258 (50.00 %) were vulnerable. Of the scenarios, 13
       | (52.00 %) had a top-scoring program vulnerable. 29 scenarios were
       | in Python, generating 571 programs total. 219 (38.4,%) were
       | vulnerable. Of the scenarios, 11 (37.93 %) had a vulnerable top-
       | scoring program.
       | 
       | I'd bet a good chunk of those were buffer-overflow related
        
       | kiwih wrote:
       | Hi - I am actually the lead author of this paper! I'd be happy to
       | answer any questions about the work.
        
         | ericpauley wrote:
         | Interesting work. My main question when reading is how the
         | results compared to manually-written code. Naturally this is a
         | much harder question to answer but it would be really
         | interesting to see the results. Could be that Copilot is doing
         | no better (or worse) than developers copying from Stack
         | Overflow.
         | 
         | Perhaps you could take a similar approach as [1] and leverage
         | MOOC participants?
         | 
         | [1] https://dl.acm.org/doi/pdf/10.1145/3383773
        
           | kiwih wrote:
           | Thanks for your feedback!
           | 
           | > My main question when reading is how the results compared
           | to manually-written code.
           | 
           | Ah, this is exactly the question. But as you say, much harder
           | to answer. Even if you run a competition, unless you can
           | encourage a wide range of developers to enter, you won't be
           | getting the real value. Instead you might be getting
           | incidence rates of code written by students/interns. Perhaps
           | if you could get a few FAANGs on board to either share
           | internal data (unlikely) or send a random sample of employees
           | (also very unlikely) to make teams and then evaluate their
           | code... It seems like a difficult question to answer.
           | 
           | We think a more doable way would be to take snapshots of
           | large open source codebases (e.g. off GitHub) and measure the
           | incidence rate of CWEs, but this also presents its own
           | challenges with analyzing the data. Also, what's the
           | relationship between open source code and all code?
           | 
           | Lots of avenues to consider.
        
         | jepler wrote:
         | It seems like many of the code examples are incorrect in the
         | pdf for example, figure 6b vs the file in the actual dataset
         | experiments_dow/cwe-787/codeql-eg-
         | PotentialBufferOverflow/Copilot -- lines are truncated at the
         | first "%" char or something along those lines.
        
           | moyix wrote:
           | Yes, unfortunately the code example got somewhat mangled as
           | it passed through the arXiv sanitization script [1]. The
           | original is:                   char str_a[20], str_b[20],
           | str_c[20];         sprintf(str_a, "%f", a);
           | sprintf(str_b, "%f", b);         sprintf(str_c, "%f", c);
           | 
           | [1] https://github.com/google-research/arxiv-latex-cleaner
        
         | bhuga wrote:
         | Is there equivalent empirical data from real programmers?
         | 
         | That is to say, you have code prompts here, let Copilot fill in
         | the gaps, and rate that code. Is there a study that uses the
         | same prompts with a selection of programmers to see if they do
         | better or worse?
         | 
         | I'm curious because in my testing of copilot, it often writes
         | garbage. But if I'm being honest, often, so do I.
         | 
         | I feel like Twitter's full of cheap shots against copilot's bad
         | outputs, but many of them don't seem to be any worse than
         | common errors. I would really like to see how copilot stands up
         | to the existing human competition, especially on axes of
         | security, which are a bit more objectively measurable than
         | general "quality".
        
           | kiwih wrote:
           | Yes, the work definitely lends itself towards the question
           | "is this better or worse than an equivalent human developer?"
           | This is quite a difficult question to answer, although I
           | agree that simply giving a large number of humans the same
           | prompts could be insightful. However, then you would be
           | rating against an aggregate of humans, rather than an
           | individual (i.e. this is "the" copilot). Also, knowing
           | research, you would really be comparing against a random
           | corpus of student answers, as it is usually students that
           | would be participating in a study such as this.
           | 
           | Nonetheless, we think that simply having a quantification of
           | Copilot's outputs is useful, as it can definitely provide an
           | indicator of how risky it might be to provide the tool to an
           | inexperienced developer that might be tempted to accept every
           | suggestion.
        
             | laumars wrote:
             | Rather than comparing against students in lab conditions,
             | I'd be more interested to see it compare to students with
             | access to Stack Overflow et al vs students with access to
             | just Co Pilot. Ie is a junior developer more likely to
             | trust bad suggestions online vs bad suggestions made by Co
             | Pilot?
        
               | sdevonoes wrote:
               | Junior engineers will trust whatever information is
               | provided to them as long as it is easily accessible. The
               | reason juniors consult Stack Overflow is because is one
               | Google search and one click away, whereas consulting the
               | official documentation/reference takes more effort
               | (because they usually don't appear on Google when one
               | searches for errors/bugs/how-to). If Copilot (or another
               | similar tool) is very well integrated in whatever IDE a
               | junior is using, you can be sure it will be used and
               | trusted because it will be faster than Google+SO.
        
         | spywaregorilla wrote:
         | Supposing a team was building a product without a rigorous
         | security focus or experience. Do you have any reason to believe
         | a co-pilot enabled team would produce more or less secure
         | products?
        
           | kiwih wrote:
           | This is a difficult question to answer as one team might be
           | very different from another team.
           | 
           | However, (my opinion only follows) I think our paper shows
           | that there is a danger for Copilot to suggest insecure code -
           | and inexperienced / security non-aware developers may accept
           | these suggestions without understanding the implications,
           | whereas if they had to write the code from scratch then they
           | might (?) not make the mistakes (as they need to put in more
           | effort, meaning there might be a higher chance they stumble
           | upon the right approach - e.g. if they ask an experienced
           | developer for help).
        
             | verdverm wrote:
             | For non copilot, the words around the code found on stack
             | overflow or a blog post may indicate the lack of correct
             | security, which would be a signal to a developer that they
             | need to consider something further.
        
       | waynesoftware wrote:
       | Summary: CONCLUSIONS AND FUTURE WORK
       | 
       | There is no question that next-generation 'auto-complete' tools
       | like GitHub Copilot will increase the productivity of software
       | developers. However, while Copilot can rapidly generate
       | prodigious amounts of code, our conclusions reveal that
       | developers should remain vigilant ('awake') when using Copilot as
       | a co-pilot. Ideally, Copilot should be paired with appropriate
       | security-aware tooling during both training and generation to
       | minimize the risk of introducing security vulnerabilities. While
       | our study provides new insights into its behavior in response to
       | security-relevant scenarios, future work should investigate other
       | aspects, including adversarial approaches for security-enhanced
       | training
        
         | falcolas wrote:
         | A "lead foot" on the software development gas pedal, with no
         | attached safety systems that are activated by anybody but the
         | driver.
        
           | mistrial9 wrote:
           | just wait until github-microsoft adds a fee to use the
           | results for certain uses, and then scan all your repos
           | constantly to find code that doesn't pay up
        
           | toomuchtodo wrote:
           | Copilot didn't worsen the appsec story, it just highlighted
           | it. If you have devs who don't know how to write secure code,
           | and/or you don't have security engineering support (internal
           | or outsourced), you were already failing (or probably more
           | apropos, walking the tight rope without a net).
           | 
           | Was anyone checking the security of code copy pasted from
           | Stackoverflow? Hopefully this work gets fed back into
           | Copilot, improving it, which improves the experience (and
           | safety) for its users. Lots of folks are still writing code
           | without copilot or security engineering knowledge.
        
             | falcolas wrote:
             | > If you have devs who don't know how to write secure code
             | 
             | The problem with GHC is the developers are not writing the
             | code - they're simply accepting what's being written for
             | them, often in large quantities at a time.
             | 
             | > don't have security engineering support
             | 
             | Valuable, but my analogy was intended to point out that
             | it's not inherent in the tooling.
             | 
             | > Was anyone checking the security of code copy pasted from
             | Stackoverflow
             | 
             | Yes, other users on Stackoverflow via comments and other
             | answers. They're not perfect, but their checks and balances
             | exist as a facet of that tool.
             | 
             | > Hopefully this work gets fed back into Copilot
             | 
             | Only if it's open source, and a large volume of it, to
             | boot. In other words, I don't hold hope that the security
             | situation will be better anytime soon.
        
           | mbesto wrote:
           | > activated by anybody but the driver.
           | 
           | Except this is precisely what the abstract is saying is a
           | misuse of the system. You have the _option_ to give the
           | driver the control.
           | 
           | > Ideally, Copilot should be paired with appropriate
           | security-aware tooling during both training and generation to
           | minimize the risk of introducing security vulnerabilities.
           | 
           | You're oversimplifying by assuming the purpose of CoPilot is
           | to write a whole block of text from generated code. CodePilot
           | is a 80/20 thing when every developer on HN is pedantically
           | assuming its a 100/0 one.
        
       | yodon wrote:
       | tl;dr they tested GitHub Copilot against 89 risky coding
       | scenarios and found about 40% of the roughly 1,700 sample
       | implementations Copilot generated in the test were vulnerable
       | (which makes sense given it's trained on public GitHub repos,
       | many of which contain sample code that's a nightmare from a
       | security perspective).
        
       | smitop wrote:
       | I've experimented a bit with this on the raw Codex model
       | (https://smitop.com/post/codex/), and I've found that some prompt
       | engineering can be helpful: explicitly telling the model to
       | generate secure code in the prompt sometimes helps. (such as by
       | adding to the prompt something like "Here's a PHP script I wrote
       | that follows security best practices"). Codex _knows_ how to
       | write more secure code, but without the right prompting it tends
       | to write insecure code (because it was trained on a lot of bad
       | code).
       | 
       | > the settings and documentation as provided do not allow users
       | to see what these are set to by default
       | 
       | There isn't a single default value. Those parameters are chosen
       | dynamically (on the client side): when doing more sampling with a
       | higher top_p a higher temperature is used. I haven't tracked down
       | where the top_p value is decided upon, but I _think_ it depends
       | on the context: I believe explicitly requesting an completion
       | causes a higher top_p and a more capable model (earhart), which
       | gives better but slower results than the completions you get as
       | autocomplete (which are from the cushman model with a lower
       | top_p). Copilot doesn 't use any server-side magic, all the
       | Copilot servers do is replace the GitHub authentication token
       | with an OpenAI API key and forward the request to the OpenAI API.
        
         | kiwih wrote:
         | > I've found that some prompt engineering can be helpful:
         | explicitly telling the model to generate secure code in the
         | prompt sometimes helps.
         | 
         | As noted in the diversity of prompt section, we did try a lot
         | of different/ reasonable changes to the prompt to see what
         | would happen in our SQL injection scenario. In our case, asking
         | it to make it secure actually made the prompt slightly worse
         | (!), and the biggest bias towards making the code better was
         | having other good code.
         | 
         | > There isn't a single default value.
         | 
         | That's what we also guess, but as you say, it's not written or
         | documented anywhere.
        
       | lbriner wrote:
       | Surely AI can also be taught some boundary conditions like "thou
       | shalt not build SQL from strings"?
        
         | fshbbdssbbgdd wrote:
         | I think you could use linting tools that check for things like
         | this and filter the output. Or use outputs that fail the lint
         | as negative training examples.
        
           | bee_rider wrote:
           | I don't know anything about Copilot's design, but surely they
           | passed all the code they fed it in the training stage through
           | some pretty strict linters, right? I mean that's just common
           | sense...
        
       ___________________________________________________________________
       (page generated 2021-08-23 23:00 UTC)