[HN Gopher] Research recitation: A first look at rote learning i...
       ___________________________________________________________________
        
       Research recitation: A first look at rote learning in GitHub
       Copilot suggestions
        
       Author : azhenley
       Score  : 66 points
       Date   : 2021-07-03 18:32 UTC (4 hours ago)
        
 (HTM) web link (docs.github.com)
 (TXT) w3m dump (docs.github.com)
        
       | devinplatt wrote:
       | The article is worth reading, but a good summary is at the
       | bottom:
       | 
       | > This investigation demonstrates that GitHub Copilot can quote a
       | body of code verbatim, but that it rarely does so, and when it
       | does, it mostly quotes code that everybody quotes, and mostly at
       | the beginning of a file, as if to break the ice.
       | 
       | But there's still one big difference between GitHub Copilot
       | reciting code and me reciting a poem: I know when I'm quoting. I
       | would also like to know when Copilot is echoing existing code
       | rather than coming up with its own ideas. That way, I'm able to
       | look up background information about that code, and to include
       | credit where credit is due.
       | 
       | The answer is obvious: sharing the prefiltering solution we used
       | in this analysis to detect overlap with the training set. When a
       | suggestion contains snippets copied from the training set, the UI
       | should simply tell you where it's quoted from. You can then
       | either include proper attribution or decide against using that
       | code altogether.
       | 
       | This duplication search is not yet integrated into the technical
       | preview, but we plan to do so. And we will both continue to work
       | on decreasing rates of recitation, and on making its detection
       | more precise.
        
         | sillysaurusx wrote:
         | The answer wasn't obvious to me. Nice solution.
         | 
         | It sounds like you're a part of the Copilot team. If so, then
         | I'm happy to see the Copilot team cares about these issues at
         | all. I was expecting nothing but stonewall until the
         | conversation died out, since realistically the chance of the
         | EFF bringing or winning a lawsuit seems small. (And who else
         | would try?)
         | 
         | But when you anger the world and being so much attention to
         | this delicate issue of copyright in AI, you risk every
         | hobbyist. Suppose the world decides that AI models need to be
         | restricted. Now every person who wants to get into AI will need
         | to deal with it. I'm not sure anyone else cares, but I care,
         | because it's the difference between someone getting into
         | woodworking (an unrestricted hobby) vs becoming a lawyer or
         | doctor (the maximally restrictive hobby). The closer we are to
         | the latter, the fewer ML practitioners we'll see in the long
         | run. And even though the world will go along fine -- it always
         | does -- it'd be a sad outcome, since the only way it could
         | happen is if gigantic corporations were flagrantly flying in
         | the face of copyright spirit, daring it to punish you.
         | 
         | My point is, please care about the right things. No one cared
         | about language filters on ML models outside of a select vocal
         | group, yet look how deeply OpenAI took those concerns to heart.
         | Everybody cares whether their personal or professional work is
         | being ripped off by an overfitted AI model, and it wasn't
         | obvious that GitHub or OpenAI gave it more than a passing
         | thought.
         | 
         | Backlinking to the training set should help. But it's also
         | going to catapult the concern of "holy moly, this code is GPL
         | licensed!" to the front and center of anyone who works in
         | corporate settings. Gamedev is particularly insular when it
         | comes to GPL, and I can just imagine the conversations at
         | various studios. "This thing might spit out GPL? We can't use
         | this."
         | 
         | My point is, when you launch that new feature to address
         | people's concerns, please ensure it's working. You won't be
         | able to do exact string matches against the training set; you
         | can't rely on "well, it's slightly different, so it's not
         | really the same thing." If it's substantially similar, it needs
         | to be cited. And that seems like a much tougher problem than
         | merely building an index of matching code fragments.
         | 
         | If you launch it, and it doesn't work, it's going to stoke the
         | flames. Careful not to roast.
        
           | onionisafruit wrote:
           | FYI, the post you replied to is entirely a quote from the
           | article (even though formatting makes it appear that only the
           | second paragraph is a quote). So the poster likely is not
           | working on copilot.
        
       | blamestross wrote:
       | Take aways in summary:
       | 
       | - Copilot does sometimes rote copy for nontrivial situations -
       | Mostly this happens when there isn't much context to go on. -
       | Provided an empty file, it proceeds to recommend to write the GPL
       | - They will add a "recitation detector" to copilot to indicate
       | non-novel recommendations
       | 
       | By the standards of corp-speak this is pretty good, they admit
       | there is a problem there and they intend to do something
       | tractable to prevent it.
       | 
       | This entire copilot situation is far enough outside my personal
       | "mental ethics model" that I'm personally abstaining form taking
       | a stance until I have had a lot more time to think and learn
       | about it.
        
       | toxik wrote:
       | Uh, how about a disclaimer that this analysis is made /by
       | Github/?
        
         | onionisafruit wrote:
         | It's on https://docs.github.com. Of course it's by github.
        
           | notatoad wrote:
           | i don't think that's necessarily obvious. most links
           | submitted to HN from the github.com domain aren't authored by
           | github.
        
             | onionisafruit wrote:
             | That's understandable. I thought the GP was saying that the
             | article should contain the disclaimer.
        
       | iudqnolq wrote:
       | I continue to be surprised GitHub shows examples that wouldn't
       | compile/run correctly. For example, the Wikipedia scraping
       | example that the author claims is also the intuitive way to solve
       | the problem assigns each row to the global variable cols instead
       | of appending. Further, the following if statement appears to be
       | mis-indented.
        
         | qayxc wrote:
         | > I continue to be surprised GitHub shows examples that
         | wouldn't compile/run correctly.
         | 
         | Why is that surprising to you? CoPolit doesn't actually know
         | how to code, it just generates symbols that match learned
         | patterns.
         | 
         | Sometimes these generated symbols don't represent valid code
         | and since CoPilot doesn't actually perform filtering based on
         | syntax checks or JIT, these results end up as suggestions.
         | 
         | This is actually a point where future versions could greatly
         | improve the usefulness, e.g. use the compiler infrastructure to
         | verify and filter generated results.
         | 
         | This includes auto-formatting and even result scoring by code
         | metrics (conciseness, complexity, ...). Plenty of room for
         | improvement even without touching the underlying model.
        
           | iudqnolq wrote:
           | I'm not surprised the actual results are flawed. I'm
           | surprised the handful of hand-picked examples Github calls
           | out are flawed. Usually handpicked examples of algorithms are
           | the best-case performance.
        
       | thunderbird120 wrote:
       | It should be noted that this kind of behavior is entirely
       | expected from a GPT-style self-supervised sequence model. Rote
       | memorization for this kind of model is indicative of correct
       | training, not overfitting. The underlying training objective of
       | these models ideally results in a representation of the training
       | data which allows complete samples to be extracted by using
       | partial samples as keys. Actual overfitting in this kind of model
       | requires absurd parameter counts. See
       | https://tilde.town/~fessus/reward_is_unnecessary.pdf
        
       | amelius wrote:
       | Does it substitute variables correctly? E.g. if I define max(a,b)
       | or max(x,y) does it complete the definition with the right
       | variable names?
        
         | onionisafruit wrote:
         | Generally, yes. It's not guaranteed to do that correctly, but
         | I've not seen it get variable names wrong so far.
        
           | tyingq wrote:
           | There's an example called "fetch_tweets.py" at the bottom of
           | page on https://copilot.github.com/ that gets it wrong:
           | def fetch_tweets_from_user(user_name):           # deleted
           | some lines here...           # fetch tweets           tweets
           | = api.user_timeline(screen_name=user, count=200,
           | include_rts=False)
           | 
           | screen_name=user there isn't right.
           | 
           | It's a nit, but it is interesting how many of the hand-picked
           | examples on that page aren't right. Since they were hand-
           | picked, presumably to show the product off.
        
             | [deleted]
        
           | the8472 wrote:
           | > It's not guaranteed to do that correctly
           | 
           | Which is odd considering they could run this as beam search
           | with the checking part of a compiler in the loop.
        
       | tyingq wrote:
       | The analysis seems to depend on sequences where the same exact
       | words appear X times in the same order. If my understanding of
       | how this works is right, they have the ability to globally change
       | symbol names based on the prompt. And probably other things that
       | make a literal match less likely, but what's different may be
       | trivial. Like symbol names being swapped, use of equivalent
       | operators (+=1 vs ++, etc), order swap where it doesn't matter,
       | etc.
       | 
       | Of course, I'm just speculating since I don't have access to the
       | product, but I have seen GPT-3 output that is verbatim plus some
       | synonym swapping.
        
       | ionwake wrote:
       | Sorry for the basic question, but the code one builds on this
       | platform is saved to githubs servers?
        
         | rvz wrote:
         | Anything pasted or typed into that Copilot editor is sent to
         | GitHub as 'telemetry'.
         | 
         | > In order to generate suggestions, GitHub Copilot transmits
         | part of the file you are editing to the service.
         | 
         | So Yes.
        
           | astrange wrote:
           | That's not telemetry, it's the prompt to the model to
           | generate the rest of the text. I would assume it's not saved.
        
             | onionisafruit wrote:
             | I wouldn't think to call that telemetry either, but it is
             | addressed in this doc about telemetry:
             | https://docs.github.com/en/github/copilot/telemetry-
             | terms#ad...
        
         | Hamuko wrote:
         | Don't know about saved, but definitely sent.
         | 
         | I guess one risk after knowing that Github regards all source
         | code regardless of its license to be fair game for training
         | Copilot is that you probably can't know for sure that your new
         | code is not being used to teach the model more.
        
           | onionisafruit wrote:
           | It's clear the author of this article had access to the code
           | that triggered the copilot suggestions. They also say this
           | was from an internal trial of copilot, so it might be that
           | these trial users were told their code could be seen by their
           | coworkers.
        
         | k__ wrote:
         | yes
        
       ___________________________________________________________________
       (page generated 2021-07-03 23:00 UTC)