[HN Gopher] Research recitation: A first look at rote learning i... ___________________________________________________________________ Research recitation: A first look at rote learning in GitHub Copilot suggestions Author : azhenley Score : 66 points Date : 2021-07-03 18:32 UTC (4 hours ago) (HTM) web link (docs.github.com) (TXT) w3m dump (docs.github.com) | devinplatt wrote: | The article is worth reading, but a good summary is at the | bottom: | | > This investigation demonstrates that GitHub Copilot can quote a | body of code verbatim, but that it rarely does so, and when it | does, it mostly quotes code that everybody quotes, and mostly at | the beginning of a file, as if to break the ice. | | But there's still one big difference between GitHub Copilot | reciting code and me reciting a poem: I know when I'm quoting. I | would also like to know when Copilot is echoing existing code | rather than coming up with its own ideas. That way, I'm able to | look up background information about that code, and to include | credit where credit is due. | | The answer is obvious: sharing the prefiltering solution we used | in this analysis to detect overlap with the training set. When a | suggestion contains snippets copied from the training set, the UI | should simply tell you where it's quoted from. You can then | either include proper attribution or decide against using that | code altogether. | | This duplication search is not yet integrated into the technical | preview, but we plan to do so. And we will both continue to work | on decreasing rates of recitation, and on making its detection | more precise. | sillysaurusx wrote: | The answer wasn't obvious to me. Nice solution. | | It sounds like you're a part of the Copilot team. If so, then | I'm happy to see the Copilot team cares about these issues at | all. I was expecting nothing but stonewall until the | conversation died out, since realistically the chance of the | EFF bringing or winning a lawsuit seems small. (And who else | would try?) | | But when you anger the world and being so much attention to | this delicate issue of copyright in AI, you risk every | hobbyist. Suppose the world decides that AI models need to be | restricted. Now every person who wants to get into AI will need | to deal with it. I'm not sure anyone else cares, but I care, | because it's the difference between someone getting into | woodworking (an unrestricted hobby) vs becoming a lawyer or | doctor (the maximally restrictive hobby). The closer we are to | the latter, the fewer ML practitioners we'll see in the long | run. And even though the world will go along fine -- it always | does -- it'd be a sad outcome, since the only way it could | happen is if gigantic corporations were flagrantly flying in | the face of copyright spirit, daring it to punish you. | | My point is, please care about the right things. No one cared | about language filters on ML models outside of a select vocal | group, yet look how deeply OpenAI took those concerns to heart. | Everybody cares whether their personal or professional work is | being ripped off by an overfitted AI model, and it wasn't | obvious that GitHub or OpenAI gave it more than a passing | thought. | | Backlinking to the training set should help. But it's also | going to catapult the concern of "holy moly, this code is GPL | licensed!" to the front and center of anyone who works in | corporate settings. Gamedev is particularly insular when it | comes to GPL, and I can just imagine the conversations at | various studios. "This thing might spit out GPL? We can't use | this." | | My point is, when you launch that new feature to address | people's concerns, please ensure it's working. You won't be | able to do exact string matches against the training set; you | can't rely on "well, it's slightly different, so it's not | really the same thing." If it's substantially similar, it needs | to be cited. And that seems like a much tougher problem than | merely building an index of matching code fragments. | | If you launch it, and it doesn't work, it's going to stoke the | flames. Careful not to roast. | onionisafruit wrote: | FYI, the post you replied to is entirely a quote from the | article (even though formatting makes it appear that only the | second paragraph is a quote). So the poster likely is not | working on copilot. | blamestross wrote: | Take aways in summary: | | - Copilot does sometimes rote copy for nontrivial situations - | Mostly this happens when there isn't much context to go on. - | Provided an empty file, it proceeds to recommend to write the GPL | - They will add a "recitation detector" to copilot to indicate | non-novel recommendations | | By the standards of corp-speak this is pretty good, they admit | there is a problem there and they intend to do something | tractable to prevent it. | | This entire copilot situation is far enough outside my personal | "mental ethics model" that I'm personally abstaining form taking | a stance until I have had a lot more time to think and learn | about it. | toxik wrote: | Uh, how about a disclaimer that this analysis is made /by | Github/? | onionisafruit wrote: | It's on https://docs.github.com. Of course it's by github. | notatoad wrote: | i don't think that's necessarily obvious. most links | submitted to HN from the github.com domain aren't authored by | github. | onionisafruit wrote: | That's understandable. I thought the GP was saying that the | article should contain the disclaimer. | iudqnolq wrote: | I continue to be surprised GitHub shows examples that wouldn't | compile/run correctly. For example, the Wikipedia scraping | example that the author claims is also the intuitive way to solve | the problem assigns each row to the global variable cols instead | of appending. Further, the following if statement appears to be | mis-indented. | qayxc wrote: | > I continue to be surprised GitHub shows examples that | wouldn't compile/run correctly. | | Why is that surprising to you? CoPolit doesn't actually know | how to code, it just generates symbols that match learned | patterns. | | Sometimes these generated symbols don't represent valid code | and since CoPilot doesn't actually perform filtering based on | syntax checks or JIT, these results end up as suggestions. | | This is actually a point where future versions could greatly | improve the usefulness, e.g. use the compiler infrastructure to | verify and filter generated results. | | This includes auto-formatting and even result scoring by code | metrics (conciseness, complexity, ...). Plenty of room for | improvement even without touching the underlying model. | iudqnolq wrote: | I'm not surprised the actual results are flawed. I'm | surprised the handful of hand-picked examples Github calls | out are flawed. Usually handpicked examples of algorithms are | the best-case performance. | thunderbird120 wrote: | It should be noted that this kind of behavior is entirely | expected from a GPT-style self-supervised sequence model. Rote | memorization for this kind of model is indicative of correct | training, not overfitting. The underlying training objective of | these models ideally results in a representation of the training | data which allows complete samples to be extracted by using | partial samples as keys. Actual overfitting in this kind of model | requires absurd parameter counts. See | https://tilde.town/~fessus/reward_is_unnecessary.pdf | amelius wrote: | Does it substitute variables correctly? E.g. if I define max(a,b) | or max(x,y) does it complete the definition with the right | variable names? | onionisafruit wrote: | Generally, yes. It's not guaranteed to do that correctly, but | I've not seen it get variable names wrong so far. | tyingq wrote: | There's an example called "fetch_tweets.py" at the bottom of | page on https://copilot.github.com/ that gets it wrong: | def fetch_tweets_from_user(user_name): # deleted | some lines here... # fetch tweets tweets | = api.user_timeline(screen_name=user, count=200, | include_rts=False) | | screen_name=user there isn't right. | | It's a nit, but it is interesting how many of the hand-picked | examples on that page aren't right. Since they were hand- | picked, presumably to show the product off. | [deleted] | the8472 wrote: | > It's not guaranteed to do that correctly | | Which is odd considering they could run this as beam search | with the checking part of a compiler in the loop. | tyingq wrote: | The analysis seems to depend on sequences where the same exact | words appear X times in the same order. If my understanding of | how this works is right, they have the ability to globally change | symbol names based on the prompt. And probably other things that | make a literal match less likely, but what's different may be | trivial. Like symbol names being swapped, use of equivalent | operators (+=1 vs ++, etc), order swap where it doesn't matter, | etc. | | Of course, I'm just speculating since I don't have access to the | product, but I have seen GPT-3 output that is verbatim plus some | synonym swapping. | ionwake wrote: | Sorry for the basic question, but the code one builds on this | platform is saved to githubs servers? | rvz wrote: | Anything pasted or typed into that Copilot editor is sent to | GitHub as 'telemetry'. | | > In order to generate suggestions, GitHub Copilot transmits | part of the file you are editing to the service. | | So Yes. | astrange wrote: | That's not telemetry, it's the prompt to the model to | generate the rest of the text. I would assume it's not saved. | onionisafruit wrote: | I wouldn't think to call that telemetry either, but it is | addressed in this doc about telemetry: | https://docs.github.com/en/github/copilot/telemetry- | terms#ad... | Hamuko wrote: | Don't know about saved, but definitely sent. | | I guess one risk after knowing that Github regards all source | code regardless of its license to be fair game for training | Copilot is that you probably can't know for sure that your new | code is not being used to teach the model more. | onionisafruit wrote: | It's clear the author of this article had access to the code | that triggered the copilot suggestions. They also say this | was from an internal trial of copilot, so it might be that | these trial users were told their code could be seen by their | coworkers. | k__ wrote: | yes ___________________________________________________________________ (page generated 2021-07-03 23:00 UTC)