[HN Gopher] Analyzing the legal implications of GitHub Copilot ___________________________________________________________________ Analyzing the legal implications of GitHub Copilot Author : HNCommenterAD Score : 149 points Date : 2021-07-15 16:08 UTC (6 hours ago) (HTM) web link (fossa.com) (TXT) w3m dump (fossa.com) | darau1 wrote: | I honestly thought the great-gitlab-exodus indicated that people | saw this coming a hundred miles away. | heavyset_go wrote: | > _"If you look at the GitHub Terms of Service, no matter what | license you use, you give GitHub the right to host your code and | to use your code to improve their products and features," Downing | says. "So with respect to code that's already on GitHub, I think | the answer to the question of copyright infringement is fairly | straightforward."_ | | GitHub's Terms of Service doesn't override licensing terms. | invokestatic wrote: | Actually, it does. When you upload code to Github, you are | effectively "dual licensing" the code to them under Github's | terms. Github is not bound to any other licenses you may have | applied to your license, because it did not agree to those | terms. It only agreed to the terms spelled out in the Terms of | Service. Of course, there are edge cases in which you could | upload code to Github that you do not own, for which I do not | know the answer to. | lakecresva wrote: | It doesn't have to, the rights you grant Github when you agree | to the ToS and upload your work exist independent of any rights | you might grant as part of the repo's license. | sampo wrote: | > Downing thinks there's a strong case that Copilot uses said | code in a transformative manner, which would support a fair use | argument that there is no copyright infringement. | | Fair use seems to be a legal concept that mostly only exists in | the anglosphere. How will this be in the many other countries, | then? | | https://en.wikipedia.org/wiki/Fair_use | Rolpa wrote: | Here's an inquiry for those more knowledgeable about IP law than | myself: what's the state of the law regarding training an AI on | copyrighted material besides code? I was debating this with | someone in relation to the high definition texture packs for old | games people have been making using models such as ESRGAN - do | these infringe the copyright of the rights holders of the | original assets? Or are they considered sufficiently | transformative to be considered an original work? | mdasen wrote: | The problem with GitHub Copilot is that you never quite know | where the suggestion comes from. | | As the article notes, longer and more complex blocks of code are | most likely copyrightable. | | > GitHub reports that Copilot is mostly producing brand-new | material, only regurgitating copies of learned code 0.1% of the | time. | | For me, the issue is one of risk. Let's say that you have 100 | developers at your company making software for you and they | decide that Copilot is great. 1 in 1,000 suggestions is | regurgitated code verbatim. Let's say that only 1 in 10 of those | suggestions is sufficiently long and complex enough that it | warrants copyright protection. Within a week, you'd have to | assume that you have dozens of copyrighted pieces of code in your | codebase. The big issue is that you now don't know where the code | came from and which pieces might be direct copies. It opens up a | bit of a can of worms for a company looking to avoid risk. | | I think one of the pieces that might get overlooked is someone | trying to weaponize Copilot. For example, Wikipedia has seen | people upload creative-commons licensed media to Wikipedia and | then become very litigious against people who might be slightly | off in the attribution requirements. Attribution requirements are | often more complicated than just "provide whatever attribution | you think makes sense." The images are legitimately creative- | commons licensed, but if someone doesn't provide the correct | attribution, they sue them. This attribution can include the | documentation of the modifications made, author, link, link to | the license (which I think a lot of people forget), copyright | notice, etc. | | https://news.ycombinator.com/item?id=27606035 | | I don't think most people are looking to be copyright trolls. | However, Copilot offers a neat little way to potentially inject | your code into other people's programs. Will people start | searching for uses of their code and use it as a form of | copyright trolling? I don't think most people will, but we've | seen it happen with patents and images. | | If you have a hundred engineers creating dozens of co-pilot | suggested blocks per day, we're talking around a million blocks | in a year. I don't think the odds of any individual suggested | block being a problem are high. The issue is when you start | scaling that up. If we're talking about a large company, the risk | can start getting large. You don't know where the code came from | and it starts getting likely that verbatim pieces of someone | else's code are finding their way into your codebase. | | Does Copilot offer enough value to offset this risk? Will future | versions of Copilot make sure that the suggestions are | sufficiently different from the training source? Heck, there can | even be chicken-and-egg problems where someone claims copyright | on a block of code that was generated by Copilot and you then | have to prove that your identical code generated by Copilot isn't | an infringement. Can you prove "yes, Copilot would have generated | that code block before you pushed your code into Github" when | they claim "Copilot only generated that code block for you | because of our code on Github"? It might not even be a company | that's evil doing this. Large companies often have no idea what | different parts of the company are doing - especially several | years later. | | One thing I want to make clear is that this isn't just about | cases that would win on their merits. One of the big parts of the | Wikipedia discussion on overly-litigious uploaders is that it can | cost a lot more to fight infringement claims than they're asking. | If someone slaps your startup with a $250 "you stole my | copyrighted code" claim, do you hire a lawyer at an hourly rate | that might cost more, risk a trial costing tens of thousands, and | risk a judgement against you? Or do you pay them off with a small | amount of money to make it go away? I'm not saying this is a good | situation. I'm just noting that it definitely exists and trolls | can try and come after you at the worst times like when you're | trying to raise funding. Do you decide to fight it when you're | trying to IPO? Do you let the IPO price sink by a few percent and | lose you lots of money when they're just looking for $5,000 to go | away? | | It just seems like adding a lot of risk. | tvirosi wrote: | If this really counts as fair use it turns into a giant loophole | to steal any IP you want. Just create a website with a github- | like TOS, upload some disney copyrighted pictures to it, train a | GAN super overfitted on the images, and then claim mickey mouse | as your own. | invokestatic wrote: | The legal system is generally pretty nuanced, considering | things such as intent and purpose. In this particular case, it | doesn't really matter how the new work was generated or | created. I don't really think that would be very relevant. The | most important factors would be how similar the new work was to | the original work, the intent, and how the new work affects the | value of the original. | | Your proposal is just so substantively different from Copilot | that I don't see how the arguments for Copilot would apply. | 6gvONxR4sf7o wrote: | You can't claim mickey mouse as your own, but you can exploit | all the labor that went into creating all the work you're | training on. Point a generative model at someone else's labor | and now it'll do that for you. It seems like the person whose | labor is being used should be somehow compensated, or at least | have some say in its use. | sobellian wrote: | The lawyer's argument is that Copilot's query system is | transformative. If you assemble Copilot's output to replicate a | copyrighted work, then even if Copilot isn't infringing, you | are by taking that work out of context. The burden is on the | owner to ensure they don't infringe. | erhk wrote: | You cant just upload copyrighted photos. You dont own them. | tompazourek wrote: | This is interesting. | | But I think you'd violate Disney's copyright by uploading their | pictures to the website. | | To make it work, Disney would have to upload the pictures | themselves and agree to the TOS. | heavyset_go wrote: | Is GitHub making sure that license terms are being met when | they train Copilot on hosted code? Because anyone can rehost | code that they don't have the rights to, and it seems like | GitHub will still train Copilot on it. | tompazourek wrote: | If someone is rehosting code that they don't have copyright | to, it's like if someone would upload a pirated movie to | YouTube. | | YouTube will still make money from it for some time | (selling ads, luring customers in, ...), then the copyright | holder asks YouTube to take it down, and then they take it | down. | | The difference is that open source authors don't care that | much about that. But maybe now they will when they see what | GitHub is doing... | heavyset_go wrote: | > _YouTube will still make money from it for some time | (selling ads, luring customers in, ...), then the | copyright holder asks YouTube to take it down, and then | they take it down._ | | YouTube isn't publishing derivative work from the videos | it hosts, though, like Microsoft is doing with Copilot | and GitHub. | | If Copilot was trained on material it doesn't have the | license to, it can potentially output that unlicensed | code it was trained on, like in this example[1]. | | Copilot could serve up copyrighted work in the same way | YouTube does, but the analogy isn't complete, because | YouTube itself isn't a derivative work in the same way | the Copilot's model is a derivative of the data it was | trained on. | | [1] | https://twitter.com/mitsuhiko/status/1410886329924194309 | 6gvONxR4sf7o wrote: | Regardless of whether it's fair use, copilot wouldn't be possible | without the enormous amount of person-hours of work that has gone | into writing the code it was trained on. There should be some | kind of compensation for the content creators when their work is | used to train models. The fair use argument is that "I could see | it" is enough to justify no compensation and no say in how their | work is used. | | Legal? Probably. Should we do better? Probably. | | At the very least, it should be opt-in. We'll probably need new | IP law to make this kind of thing opt-in. | dekhn wrote: | These agree with my conclusions- it's fair use or permitted by | license, but that it remains untested (as the GPL does in a | larger sense) by law. | | I guess in about 5 years we'll see Softbank v. GitHub CoPilot in | the supreme court deciding whether ML can make transformative | work. | shadilay wrote: | Often times legal issues are more a question of finding a | plaintiff with enough money to sue rather than the letter of | the law. | kyrra wrote: | Google v Oracle took 11 years to reach conclusion. | wolverine876 wrote: | Why do we accept that the courts are so slow? I don't | understand why there isn't a drive to reform courts by | accelerating outcomes by an order of magnitude, and by making | outcomes not depend on wealth. | sokoloff wrote: | A court needs a case. A case requires at least one litigant | who is willing to go all the way to the extent of forcing a | court hearing (rejecting settlements along the way and | risking that a court will decide against them). I don't see | the courts as being the rate-limiting factor when we're | contemplating licenses that are 30 years old last month (GPL | v2) | [deleted] | gnopgnip wrote: | The courts being slow to change, respecting precedent is a | feature. It should be on congress to change the law and | ammend the copyright act | pc86 wrote: | You're assuming that the speed of a court case has something | to do with the wealth of the participants? | AdamJacobMuller wrote: | "no matter what license you use, you give GitHub the right to | host your code and to use your code to improve their products and | features" | | I contribute my code to X project outside of github (say on a | mailing list) under the explicit understand that my code is under | GPL (say GPLv3 to be specific). If someone later uploads my code | to github and github uses my code to train their ML model in | violation of GPLv3 isn't the point that the person who uploaded | my code to github is in violation of GPL by giving it to someone | else under less restrictive terms? | | Does this mean that the github terms of service are perhaps | fundamentally incompatible with uploading copyleft-style (or | perhaps specifically only GPLv3 level) restrictive licenses? | | And, if so, probably they always were but nobody cared until now. | tyingq wrote: | _"If you look at the GitHub Terms of Service, no matter what | license you use, you give GitHub the right to host your code and | to use your code to improve their products and features," Downing | says. "So with respect to code that's already on GitHub, I think | the answer to the question of copyright infringement is fairly | straightforward."_ | | I don't know if it's really that straightforward. The TOS | includes snippets like this in that area: | | _" This license does not grant GitHub the right to sell Your | Content. It also does not grant GitHub the right to otherwise | distribute or use Your Content outside of our provision of the | Service"_ | | I'm omitting other language, but if you read that area of the | TOS, they seem to have purposefully scoped down their license-to- | use for hosting, backups, etc. | lindenksv85 wrote: | They technically don't distribute any code. They show it to you | in a hosted environment. It's the user that causes a | distribution. They "provide it as part of the service." | tyingq wrote: | I'm not sure how sending it over the wire to Visual Studio | doesn't count as distribution. Distribution without | attribution, reference to where it came from, how it's | licensed, etc. | dheera wrote: | Conceptually it's not particularly any different from | distributing it to your web browser. They basically just | turned Visual Studio into a fancy Github browser that has | some editing features. | inlined wrote: | I disagree. They're the API host, so they're the | distributor as well as the user agent | zufallsheld wrote: | In the same vain streaming sites don't distribute movies, | they show them to, you in a hosted environment. and this | argument did not hold up in court. | lindenksv85 wrote: | OSS license obligations mostly kick in upon distribution, | hence this is a pivotal concept in this context. It's also | important because of the language in the TOS that says the | code won't be used outside the service. The stuff related | to streaming is kind of unrelated here because movies | aren't under copyleft licenses and so the question of | whether or not there was distribution there is not | relevant- the question is whether or not the copyright | holder's monopoly right were violated and those include the | right of public performance, public display, as well as | distribution. They would have violated other copyright | rights even without a finding of distribution. | [deleted] | tomrod wrote: | Further, what if I branched the code from something hosted | outside of Github -- and failed to follow proper attribution? | | This is a huge legal mess and its not being used to IMPROVE | Github products and ops, it IS the Github product. | kmeisthax wrote: | It's actually fairly difficult to remove attribution from a | Git repository. It's embedded in each commit. You'd have to | rewrite the entire project history - something far different | from just "branching" a repo. | matmann2001 wrote: | Technically, they aren't distributing "your" code. It's | laundered through their machine learning algorithm first. | hansvm wrote: | Is that actually different legally from an "ML" algorithm | that xors the code with the same garbage 1M times or | otherwise does something expensive to implement a noop? | grawprog wrote: | > to use your code to improve their products and features | | >It also does not grant GitHub the right to otherwise | distribute or use Your Content | | I'm curious. Copilot isn't actually part of github. It's a | plugin for Visual Studio wouldn't that mean copilot is | distributing code hosted on github, outside of github? You | can't use copilot without visual studio. | | How is this not Microsoft just parasitizing all the code hosted | on github to make visual studio better? Which as far as i know, | depsite being owned by Microsoft is not actually part of | github. | LeifCarrotson wrote: | I think it all depends on who 'they' are and what 'their | products' re. The definition says: | | > _' GitHub,' 'We,' and 'Us' refer to GitHub, Inc., as well | as our affiliates, directors, subsidiaries, contractors, | licensors, officers, agents, and employees._ | | Does that also include OpenAI? Does it include the Visual | Studio team? All of Microsoft? | | The license granted by users to Github is: | | > _We need the legal right to do things like host Your | Content, publish it, and share it. You grant us and our legal | successors the right to store, archive, parse, and display | Your Content, and make incidental copies, as necessary to | provide the Service, including improving the Service over | time. This license includes the right to do things like copy | it to our database and make backups; show it to you and other | users; parse it into a search index or otherwise analyze it | on our servers; share it with other users; and perform it, in | case Your Content is something like music or video._ | | > _This license does not grant GitHub the right to sell Your | Content. It also does not grant GitHub the right to otherwise | distribute or use Your Content outside of our provision of | the Service..._ | | IANAL, but naively, Github appears to be a code hosting | platform. If they need to analyze my code to make it work | with Git and with their code hosting features, that makes | sense. For example, they might have a feature to prevent | inadvertent commits of private keys, and would need to parse | my code to do so. Maybe my code contains stuff that doesn't | work with their generic private-key-finding parser, and they | need to specifically run a subset of my code on their | platform through their parser in a debugger to fix the | feature. That's a sensible license to grant to a code hosting | platform, they're not a no-knowledge encrypted storage | provider. | | They don't appear to be a software vendor that sells code to | other private parties for use in closed-source applications. | Their license appears to specifically deny them the right to | sell snippets of my code to others. | | I suspect, however, that this isn't a black-and-white factual | issue, rather, one for a court to decide. One could probably | hire an attorney to argue any possible angle on the legality | of Copilot. And by a similar mechanism to the "Winner's | curse", the company who developed a tool like Copilot would | always have been one where their internal counsel advised | them that what they were doing was totally legal. | lindenksv85 wrote: | The "services" is that which is provided by GitHub. "GitHub" | is defined to include all of its affiliates, including | Microsoft. | nomoreplease wrote: | And "and to use your code to improve their products and | features," does not explicitly include "or to create new | products and features". | | CoPilot is a NEW product, not an existing product (Github | itself) that the ToS gives permission to improve. | matmann2001 wrote: | Technically, they aren't distributing Your Content. It's | laundered through their machine learning model first. | tyingq wrote: | With degrees of laundered varying from _" copied verbatim"_ | to _" minor things like symbol names changed"_ to _" actually | transformed significantly"_. | [deleted] | BeefWellington wrote: | An interesting thought experiment around this whole topic: If I | were to take all the scripts of profitable films rated G or PG | and train an AI on it, generate a bunch of scripts, then made | movies out of those scripts, would I lose in court? | | Tangibly, how is this AI method substantially different from | non-clean-room implementations? | | In terms of business use, it seems incredibly risky to me to | even just *use* GitHub since their license agreement/ToS permit | them to use my code to improve their tools which now apparently | includes tooling where it may copy your code wholesale as | someone else's suggestion. | kmeisthax wrote: | No thought experiment needed. If I watch a bunch of movies | and then make my own movie, whether or not I lose in court | depends on if the movie I made is at least "substantially | similar" to any movie I happened to watch - or, in other | words, had "access" to. That's a fact-intensive thing that | juries usually decide on a case-by-case basis. | | The difference between that and having an AI do it is | probably low. My gut instinct is that using an AI constitutes | "access" to the AI's training corpus, so if it spits out | something at least substantially similar to that corpus, then | I'm infringing if I use that output. If it _doesn 't_ | constitute access, then a copyright owner would have to prove | "striking similarity", which would really only cover things | like using Copilot to spit out fragments of old Quake code | verbatim. | | Clean-room is a way of arguing down the level of access that | you have to something that you want to make a non-infringing | copy of. It usually requires having actual attorneys review | everything the clean-room engineers get to see, and stripping | out the parts that are actually copyrightable. Merely | training an ML system on input as a way to only have access | to the uncopyrightable parts of that input probably wouldn't | work. | | Pretty much every Internet service is going to have similar | clauses to GitHub's; because anything else would basically be | a "click here to make me liable for copyright infringement" | button. In fact, I wouldn't be surprised that merely running | something like GitHub but without a ToS would still give you | similar levels of implied license over whatever people push | to your server. | kzrdude wrote: | What about open source projects where the uploader and github | users are not the only copyright holders? As a user i can't | grant github any random license for the code, if I maintain for | example Linux or python or any other old project there. | | The ONLY available terms are those given by the license, | surely? | lindenksv85 wrote: | If you are putting up code on GitHub to which you don't have | all the rights you're actually in violation of their TOS and | you are violating the rights of other copyright holders. I | understand this is common and may not violate community norms | or expectations but it is technically a license violation on | multiple fronts. Contributors who add to existing GitHub | projects are providing the same license to GitHub as the | project maintainer though per the TOS. | rightbyte wrote: | I guess the Github Copilot authors did not handpick | projects they checked were legitimately put on Github. So | they are accomplices in that case. | filoleg wrote: | YouTube doesn't really handpick things that get put on | their platform either, beyond very basics and whatever | automated tools they have to cover that. | | Beyond that, that's what DMCA takedown requests are for. | Github would only be an accomplice in that case, if they | got a legitimate DMCA takedown request and chose to | completely ignore it. | jolmg wrote: | > If you are putting up code on GitHub to which you don't | have all the rights you're actually in violation of their | TOS and you are violating the rights of other copyright | holders. | | I can't find where in the TOS it says that you must "have | all the rights [to the code]". It just says that you must | not violate copyright nor other laws.[1] FOSS licenses by | definition permit redistribution, so uploading to GitHub | seems to be in-line with the license granted by the | copyright holders. | | What are the violations you mention? | | > Contributors who add to existing GitHub projects are | providing the same license to GitHub as the project | maintainer though per the TOS. | | Sure, but that's not the only way. If you contribute to a | FOSS project elsewhere, those changes go under the same | license of the project. Whoever you pass those changes to | has liberty to redistribute per the terms of the license. | The TOS is unneeded to legally redistribute FOSS-licensed | projects with GitHub. | | The TOS saying that you must grant GitHub these permissions | is only to protect GitHub in cases where people upload | projects without licenses. | | [1] in addition to content restrictions, like no porn. | sudosysgen wrote: | But legally, they can't provide such a license. So GitHub | can't have that license, surely, because they never had the | legal authority to bestow it upon Github. | lindenksv85 wrote: | That was a problem before copilot though. And copyright | holders have and will continue to have the right to send | DMCA take-down notices if they like. | jacoblambda wrote: | But the thing to note is that a user can have a right to | distribute (as with GPL) but does not necessarily have | the rights to the license. | | So if the user uploads the source to GitHub, they agree | to the terms (which they may not actually have the rights | to) but that isn't equivalent to the rights owner giving | GitHub the rights to distribute the source under a | different license. | | The TOS can only modify those distribution terms (if it | even can be found to be legally binding) if the user | uploading the source is the rights owner which in so many | cases is not the case. | josephh wrote: | I think the bigger question is whether GitHub will be | able to honor DMCA requests that pertain to copyrighted | materials showing up in Copilot's suggestions. | Animats wrote: | A third party who finds their GPL code on Github but is | not themselves a user of Github has a right of action. | They're not bound by Microsoft's terms. | lakecresva wrote: | I'm not sure that someone who published their work under | the GPL hasn't thereby given consumers the right to put | the repo on github. If the rights Github asks for in | their ToS can be construed as a subset of the rights | granted by the GPL, Github is just another GPL licensee. | Unless they violate the conditions of the license, | they're just utilizing their GPL rights. | eitland wrote: | > Github is just another GPL licensee. Unless they | violate the conditions of the license, they're just | utilizing their GPL rights. | | And here is exactly the problem. | | GitHub seems to be copying copyrighted code left and | right _and pretend they made it!_ | | No attribution, no license. | | They are of course allowed to let their AI study the | code, but as "employer" of that AI GitHub/Microsoft has a | responsibility if that AI breaks copyright right and left | and they as a company pretend the code is theirs to give | away. | AdamJacobMuller wrote: | > is not themselves a user of Github | | Is it that widely scoped? Can't we narrow it to "A third | party who finds their GPL code on Github but has not | uploaded that specific code to Github themselves has a | right of action limited to that specific code." | | Just because I created a github account once and agreed | to the TOS doesn't mean that I agree to let others upload | my code to github, where would that scope end. Could | someone steal code off my computer which i've never | published and put it on Github and that was OK because I | once signed up for a github account, clearly a contrived | example but. | kzrdude wrote: | Today is the first time I've considered that, but it's | certainly something we should think about. If big projects | moved on this, I think github would take notice and "issue | a clarification". | btilly wrote: | _I don 't know if it's really that straightforward._ | | It gets worse. To the extent that it is that straightforward, | the correct takeaway is that you do not have permission to | include someone else's GPLed code in your Github repository. | | And that to the extent that GitHub relies on that permission in | using the code that they host, they are liable for potential | copyright claims from copyright owners that they have no | relationship with, who never gave GitHub permission to use that | code. | | I therefore think that GitHub should do some careful thinking | about how much they can rely on a ToS to do as they want with | the copyleft code that they host. And I further think that | people who host GPLed projects should ask whether GitHub is | where they should be hosting those projects. | | (Insert the mandatory, "I am not a lawyer and this is not legal | advice.") | [deleted] | phkahler wrote: | Yeah, I don't think bettering their products includes verbatim | incorporation of code into those products. | | Also, for the part about small snippets being non | copyrightable. I would suggest looking at the Google/Oracle | case. Google was found guilty of infringement for a very small | number of lines, but the award to Oracle was IIRC rather a joke | (something like one dollar, indicating it was infringing but | largely irrelevant). | zja wrote: | The Supreme Court found Google's use to be fair, not | infringement. | jcelerier wrote: | the supreme court did not reconsider the previous judgment | on the 9 lines of sorting algorithm being copied, which was | _not_ considered fair use | MrStonedOne wrote: | >"If you look at the GitHub Terms of Service, no matter what | license you use, you give GitHub the right to host your code and | to use your code to improve their products and features," Downing | says. "So with respect to code that's already on GitHub, I think | the answer to the question of copyright infringement is fairly | straightforward." | | Not as straightforward as they think thou. | | If a code project used (a)gpl code found elsewhere on the | internet in their repo, and another user took the project and | hosted it on github, the tos can not give github a license to use | the code outside of the license given by (a)gpl, even if github | thinks they have one, that won't shield them from legal | liability, nor will it shield co-pilot users from being legally | compelled to (a)gpl their code if a court case was won on those | grounds. | | The github tos is basically a non-factor in this case. | SethTro wrote: | I find the first argument, that if you're project is in GitHub | then they have the right to train in it, weak. Plenty of projects | are hosted elsewhere but have been mirrored by random users (e.g. | not the copyright holder) to GitHub | kbenson wrote: | I think they have a right to train in it, but not to present | portions verbatim. Do you have a right to look at a bunch of | open source code and come to conclusions about good programming | practices? Are you prevented from knowing that a specific | library in a language is good/common for a specific task | because you see others using it? | | That's analogous to training, where there are associations | between things, in my mind. I don't think that means they can | provide licensed code verbatim though, just as you should not | copy GPL code directly out of a Github repo and paste into your | own private commercial code base. | heavyset_go wrote: | You're taking the machine "learning" metaphor literally. A | human being learning something is not analogous to training | an ML model. Training models is more analogous to compilation | or lossy encoding or compression. | michaelpb wrote: | The biggest mistake of the ML field is its metaphorical | naming. So many people seem to be taking Artificial | Intelligence, Machine Learning, Neural Networks etc | literally. They don't do this for other concepts in coding | (eg for an absurd example, no one is arguing we ride a CPU | "bus" to work), but with ML algos its a free-for-all. | Grandiose naming conventions might be good for extracting | VC money but it's also seriously confusing people. | kbenson wrote: | I'm thinking more "association" than "learning", and in | both cases. | | If an algorithm of some sort scans a bunch of repos | regarding video encoding and decoding and sees a lot of | ffmpeg use, it might associate ffmpeg with video encoding | and decoding, and decide to present some info about ffmpeg | and a _generic_ snippet to include ffmpeg as a library and | initialize it if it associates the current project with | that. | | If I have perused a few encoding or decoding repos at some | point and I think of the current project as having to do | with encoding or decoding of video, I might immediately | think ffmpeg even if I've never used it in a project as a | library because I remembered seeing it in projects that | used it, and look for some initialization code. | | In what ways are these materially different? What makes the | random conceptual associations in my head from what I've | seen previously different than an algorithm that collects | the same? | | > Training models is more analogous to compilation or lossy | encoding or compression. | | And learning in people isn't? Isn't all knowledge | transference in people analogous to lossy encoding and | compression? | | I don't know about you, but in college I don't remember | regurgitating sections of "Advanced Programming in the UNIX | Environment" to complete assignments, I remember studying | it, internalizing parts of it on a conceptual level (as | well as remembering specific fairly small chunks almost | exactly), and using that to solve problems or answer | questions or make associations. | | I'm not saying ML and and learning in humans is the same. I | do think for the very specific case presented here in how | it's used, there are some parallels. Feel free to disabuse | me of that notion if you have evidence that contradicts it | though. I'm not wedded to that position, but I would want | to see arguments to the contrary before abandoning it. | gdsdfe wrote: | well a lot of people in here don't like these conclusions | legerdemain wrote: | Somewhat tangentially, Kate Downing is also the person who | somewhat recently campaigned to raise awareness of the crisis in | affordable housing in the Bay Area and Palo Alto in particular, | and wrote a viral editorial after giving up and moving to the | more affordable Santa Cruz.[1] | | https://news.ycombinator.com/item?id=12288306 | guitarbill wrote: | It would be nice if we moved from a copyright discussion to an | ethical one, since it could be years until the law is even | tested. | | Is it ethical to do this, when some licenses are clearly chosen | because of e.g. attribution or sharing improvements? Did | Microsoft/GitHub consider the ethical implications, for example a | chilling effect on code being open sourced in future (i.e. people | choosing not to open source stuff so it doesn't get gobbled up by | Copilot et al.)? | dmitrygr wrote: | I wonder if one could enforce a license's "this code may not be | used to train any ML model of any sort for any reason without | prior permission". | progbits wrote: | I've been wondering the same though found basically no | discussion on this topic. | | Let's ignore whether GPL or whatever license allows GitHub to | do this - let the lawyers sort this out. Instead we should | focus on whether it is possible to legally prevent such | behavior via license. | | In other words, where is my GPLv4 with anti-ml clause? | EamonnMR wrote: | FSF and SFC and OSI if you're listening, this would be very | nice. | lindenksv85 wrote: | I think it's important to recognize that most ML models will | not be built in top of copyleft material. It will mostly use | data that we as users have voluntarily provided to someone at | some point and to which that platform now claims ownership. So | we need to think long and hard about whether or not we believe | any of these models should receive any copyright protection at | all and in a much broader context. I think if we insist on | claiming that copilot is copyrightable itself and should be | under GPL then we have totally capitulated with respect to all | other use cases in a way that actually further protects | incumbent advantage for large companies and which deprived | everyone else of any benefit or remuneration for their own | data. You're basically saying it's ok for companies to | privatize the collective knowledge of all of humanity. I'm not | on board with that. | guitarbill wrote: | I don't know if by "You're basically saying" you mean me | specifically, but if you do, you're dead wrong. I'm not ok | with this at all. However, I'm not so stupid to think me, as | a non-IP lawyer can make sense of the current legal situation | (which is what copyright is; law) or even propose new laws. | | However, as a dev I can think about it and say "to me, this | is immoral and unethical", and refuse to use Copilot, not | work for any company that uses Copilot, not use | GitHub/Microsoft products, pull code from GitHub (if I had | any), and decide not to open source stuff in future. Ethics | has always been underemphasised in software compared to other | engineering disciplines. | | Generally, non-technical people are (more) impacted by ML, | but in this case it's us as developers and our open source | communities. So I hope devs will give it some thought this | time. And if this leads to devs thinking about ML more | carefully in general, great. Things don't have to be illegal | to be unethical. | lindenksv85 wrote: | I didn't mean you specifically. I think the ethical | conversation is more interesting but I also think that | people will feel different if, say, the Linux Foundation | releases its own version of copilot and it's not just one | company reaping the rewards of all that code. And I'd like | to make it easy for other competitors to do exactly that. | It will be harder for them to do that if we think that the | models themselves are copyrightable. I don't think | something like copilot is going to make anyone think twice | 5 yrs from now any more than we think twice about something | like google autocomplete or google search thumbnail images. | I think stuff like copilot if properly tuned won't be | providing a substitute for whole GPL projects. I don't | think OSS communities will be damaged by this in any way. | In fact those same oss communities are going to be some of | the biggest users of these sorts of tools just like they | use stackoverflow today. | erhk wrote: | Github is not required to open source your work | erhk wrote: | You can opensource with Git without using github. You can self | host. | xdennis wrote: | If it's publically available, there's no guaranty that | Microsoft won't gobble up your code. | kube-system wrote: | The ethical discussion certainly has its merits, but the legal | discussion is very relevant for those of us who do not want to | be part of the legal test case. | jefftk wrote: | _> If you look at the GitHub Terms of Service, no matter what | license you use, you give GitHub the right to host your code and | to use your code to improve their products and features. So with | respect to code that's already on GitHub, I think the answer to | the question of copyright infringement is fairly | straightforward._ | | This doesn't sound right. Alice writes code, and releases it | under some restrictive license (GPL, something source-available, | etc). Bob uploads it to GitHub, correctly labeled as GPL. | Regardless of GitHub's TOS, Bob isn't able to give GitHub any | additional rights to the code beyond what Alice gave him. | | I think the later discussion about whether this falls under Fair | Use is the important question. | pc86 wrote: | If Bob is unable to give GitHub the rights that GitHub demands, | then it means Bob was unable to lawfully upload the code to | GitHub in the first place. You're making an argument that Bob | violated GitHub's terms, not that GitHub is violating Alice's | (though that may also be true). | mikeryan wrote: | You're right that Bob's the infringer here and not GitHub. | But I'm not sure where that would place the derived work | that's eventually used. | | Which is the point of the article a bit. GitHub is likely not | infringing but it's also not absolving the end user of any | infringement. Neat trick. | | That being said I think the risk is minimal enough that I'd | be pretty comfortable with using it. | eitland wrote: | I learned here on HN that contracts are supposed to be a | "meeting of minds". | | And in EU, as a consumer, you can pretty much ignore most | EULAs because they aren't valid if they break EU consumer | protections. | | Now if your interpretation is correct the idea of a meeting | of minds falls completely on its face. | | And, as a lot of individuals also upload their projects to | GitHub, GitHub is on shaky ground there as well. | | I think most EULAs have clauses like these in them but we are | always told that it is because of crazy American lawyers and | nothing to worry about. | | If Microsoft decides to prove once and for all that we should | worry about ridiculously broad claims in EULAs I think it | will be hard for GitHub to continue to operate in more sane | jurisdictions. | kuratkull wrote: | Violating the TOS is not illegal, Bob would be in breach and | GitHub could take measures against Bob's account. Github | would be violating Alice's copyright license though, that is | legally enforceable | lindenksv85 wrote: | Sort of. DMCA protects service providers against copyright | infringement claims related to stuff uploaded to their | services by third parties. So long as they adhere to DMCA | requests, they're not violating copyright law themselves. | bigwavedave wrote: | > Sort of. DMCA protects service providers against | copyright infringement claims related to stuff uploaded | to their services by third parties. So long as they | adhere to DMCA requests, they're not violating copyright | law themselves. | | This is probably an extremely stupid question as I'm | neither a lawyer nor an ML dev (merely an humble backend | developer), but let's say that the above situation | applies and that Github has taken down Bob's repo as per | Alice's DMCA request. However, let's say that in between | Bob uploading the offending code and Alice submitting the | DMCA request, Github used Bob's repo as part of a | training set for Copilot. Now that they've complied with | the takedown request, does Github have to restore Copilot | to an earlier state that hadn't yet been trained by Bob's | repo? Does this question even make sense since I only | know the absolute barest bones of ML? | rented_mule wrote: | Also not a lawyer, but I've been around ML for a while. | The question makes perfect sense to me! | | It takes some amount of time to comply with a takedown | notice. For example, time passes between receiving | Alice's notice and taking down Bob's repo. | | I would expect Copilot's model(s) to be retrained | periodically in order to remain relevant. The next | retraining could exclude Alice's code. That might be a | longer window than the case of the repo takedown, but as | long as it doesn't take _too_ long they might be okay? | | There are incremental training approaches that evolve | models over time rather than completely retraining them. | In my experience, complete retraining is a far more | common approach because the highly path dependent nature | of incremental training can lead to outcomes that are | hard to manage. For example, what if you discover bad | training data like repos that collect anti-patterns? Or | Alice's takedown notice? You typically want your models | to be able to "unsee" things and that's hard with purely | incremental training. Even when incremental approaches | are used, there is often an occasional complete | retraining to overcome such issues. | | To be clear, I have no idea what training approach is | used for Copilot. | jefftk wrote: | GitHub's primary business is hosting open source software. | There's no way they are going to claim that every user who | uploaded code without owning the full rights is violating | their TOS. | dogleash wrote: | >There's no way they are going to claim that every user who | uploaded code without owning the full rights is violating | their TOS. | | If they're taken to court and that part of the TOS is | relevant to the issue, then yes, they can and will argue | exactly that. | eitland wrote: | Well then let's hope the judges apply the same standards | as when criminals claim they wasn't aware that the money | they got was being laundered through them. | jefftk wrote: | That would destroy their business. | ipaddr wrote: | That's moving the goalposts after the fact. | | What about all of the code that existed before Microsoft | purchased them and before new licease language was | introduced? | mikeryan wrote: | TOS language is usually not grandfathered in. | lindenksv85 wrote: | All of these terms of service have an assignment provision | that allows the provider to assign the agreement to an | acquirer. So the license you gave to GitHub moves to | Microsoft (though here the license likely remains with | GitHub because they are an independent subsidiary). All of | these agreements also say they can unilaterally change | terms whenever. The terms are generally always broad enough | to cover these circumstances. | fhajm wrote: | This lawyer does not understand GitHub. Half of the code is | uploaded by third parties who do not hold any copyright. | | These people either think that for ideological reasons everything | should be on GitHub or they want Google links to their companies. | | Furthermore "improve their services" reasonably only applies to | their core service that was present _when people agreed to the | TOS_ and not to some new code laundering AI. | | It is frightening that this matter could be decided by such | lawyers in the US. People should just all leave GitHub, then | Microsoft can play with its own AI and enjoy the silence. | rhdunn wrote: | Then there's the issue of any project that uses CoPilot. For | example, if a developer of proprietary software uses this and | it is later found that the code matches GPL code, they would be | liable. Likewise if an open source uses code from a different | license or proprietary code via this. | | Looking at the source code or the function and variable names | in binaries, you cannot tell if CoPilot is used or not, so | there isn't a functional difference between someone copying | that code or CoPilot copying it. | coding123 wrote: | I just keep flagging this. It's getting over analyzed into the | ground. Sue them if you want but it's just a waste of time to | keep talking about this. | wolverine876 wrote: | > As we mentioned, GitHub trains Copilot on numerous pieces of | public code, many of which are covered by strong copyleft | licenses (i.e. GPL v2, GPL v3). Copyleft licenses require that | derivative works (of the copyleft-licensed code) must carry the | same license as the original code. | | Even when no GPL v2/3 code is quoted by Copilot, is using the | code for training a non-free product allowed under the license? | Under the license, is Copilot therefore now licensed GPL v2/3? | The GPL code was certainly used to create a critical, integral | part of Copilot, and to create its output. | | If I understand correctly, GPL v2/3 were designed to prevent non- | free products from being parasites on FOSS code, taking and not | giving. If that's the spirit of GPL, Copilot seems to clearly | violate it. | invokestatic wrote: | When you upload code to Github, you agree to license it to them | under Github's terms, and not whatever license the software is | typically distributed under. You are effectively "dual | licensing" software by uploading it to Github, whether you | realize it or not. Of course, there are edge cases in which you | don't have the rights to license the software to Github, but in | those cases, I don't have the answer. | jcelerier wrote: | > "To the extent you see a piece of suggested code that's very | clearly regurgitated from another source -- perhaps it still has | comments attached to it, for example -- use your common sense and | don't use those kinds of suggestions." | | how is "use common sense" even remotely a meaningful thing | rjzzleep wrote: | Especially coming from a supposed lawyer. | scintill76 wrote: | I thought this was weird too. Why are comments the dividing | line? Because they sound like a human? How do we know Copilot | won't regurgitate an exact copy of human code that doesn't have | comments? | | It's kinda surprising Copilot even reads and outputs comments. | pedrocr wrote: | Given this fair use argument that the work is probably | transformative enough here's what I'll be doing next. I'll take | the Windows and Office source code, run it through a decompiler | and then train a neural network on that output. This sequence of | steps should be at least as transformative of Microsoft's | copyright as what Copilot is doing with the open-source corpus, | probably much more so. I will then use that neural network to | write patches for ReactOS and WINE. Since those projects are very | weary of interaction with Microsoft copyrighted works could | Microsoft Legal please publicly state their assurance that all | this is perfectly legal use of their copyrights? Maybe that would | help convince people. | EamonnMR wrote: | Might be faster to generate verbatim copies of Disney IP. | invokestatic wrote: | I've heard something similar in response to Copilot in another | thread (something like offering a sum of money to Github if | they train their model exclusively on the Windows NT source | code). But I think the legal theory here is that Copilot is | trained on many thousands of sources. If Copilot was trained on | a single source, or even a small handful of sources, the | derivative work claim becomes much stronger. When trained on | many sources, it becomes much harder to claim that its a | derivative of another work. | | Take for example a human. If I studied a bunch of different | open-source projects, learned techniques from them, and | implemented them in my own projects, is that a derivative work? | Probably not. But if I were to reverse engineer Windows and | implement the techniques I saw in ReactOS, that's where it | seems issues start to arise. | pedrocr wrote: | So I just need to decompile Oracle's database and a few other | commercial products as well and I'm good? Is Microsoft legal | happy if I do Windows+Office+OpenSourceCorpus? I'd take that | statement as well. Or even if they just do that themselves | and train Copilot on their internal source code just as they | do with the public open-source corpus. That would be a strong | statement as well. | breischl wrote: | >If I studied a bunch of different open-source projects, | learned techniques from them, and implemented them in my own | projects, is that a derivative work? Probably not. | | That's pretty unclear actually. If it's quite close to the | original work, it is derivative. Even though you have | probably been "trained" on quite a few different codebases | over the years. Hence the existence of clean-room | implementations, wherein the people building a new | implementation have never seen the original. | | Also, given that code that has been passed through a | biological network (ie, brain) can constitute infringement, | it seems obvious that code passed through a mechanical one | could too. Maybe not in every case, but it certainly seems | plausible. | erhk wrote: | Well i could certainly construct a data set wjere windows NT | is an atomic outlier and muddy the water with arbitrary | inputs to satisy your irrelevant requirement. Perhaps ill jam | some pictures of cows in, or any nubmber or animal photos. | Maybe even some classical literature. Hell, maybe i even just | jam a shit ton of Javascript in. Thats code right? | heavyset_go wrote: | You're taking the machine learning metaphor literally. | Training an ML model is not the same thing as a human being | learning off of material. | | A human being can understand abstract concepts and reason | about them based on material they learn from. An ML model is | a statistical model that is closer to compilation or lossy | encoding or compression. | | Often, ML models can encode their training data verbatim in | the model itself, which is exactly what happened with Copilot | and this example[1]. | | [1] https://twitter.com/mitsuhiko/status/1410886329924194309 | klyrs wrote: | > When trained on many sources, it becomes much harder to | claim that its a derivative of another work. | | Sounds good in theory, until it starts producing snippets | verbatim from uniquely-identifiable sources. | ghoward wrote: | I am doing something very similar: | https://twitter.com/GavinDHoward/status/1415380847537135620 . | We'll see if they answer. | modeless wrote: | > If you look at the GitHub Terms of Service, no matter what | license you use, you give GitHub the right to host your code and | to use your code to improve their products and features | | Sure, that's fine if the author of the code chooses to upload it | to GitHub. But what if they don't, and then someone else does? If | I take an AGPL project that someone else wrote and upload it to | GitHub, does that grant GitHub the right to use the code "to | improve their products and features" which are closed source? I | don't have the right to relicense the code, and neither does | GitHub, so clearly not. | _ph_ wrote: | I think the discussions miss a bit an important point. IANAL, but | I think if a young programmer reads a lot of source code on | GitHub, and based on this reading becomes a better programmer, | this is a fair use of copyrighted material and pretty much | independant of the license. If I read any book and learn the | corresponding language, this isn't a copyright violation of the | book either. This starts, when I begin to quote from that book or | the programmer takes snippets from the programs that got read. | | The problem is, I don't think you can really claim that Copilot | learned to program. While some of the output seems to be | something new, most of the times it looks more like a | recomposition of learned fragments if not even longer pieces of | verbatim code taken from copyrighted material. We have seen | examples of this. And in this moment, it becomes a copyright | discussion, probably determined by the volume of copyrighted | material reproduced. Which by the way is always the risk if a | human uses certain training material. The better one is at | memorizing things, the more there is the risk. | | Or put it the other way around: if Copilot would use its | "knowledge" of programs to advise the programmer like pointing | out potential errors without reproducing anything it used for | learning, it should be fine. But that is not how it works. | jozvolskyef wrote: | If a person who's never seen a goat looks at a million | copyrighted images of goats and draws a goat, are they | committing copyright infringement? | | What if an algorithm does the same? The result is 'a | recomposition of learned fragments' in either case. | aaron695 wrote: | Do we have any proof that Copilot works? | | I assume it's a pile of rubbish that's currently fooling Youtube | hype based programmers and followers. Has any ok but real | programmer used it solid for a week yet and wants to keep going? | | This is tied to the legal argument. | | If Copilot works (Which I cannot believe it would) it changes | many legal points. Garbage spewing out copyright code is | different to something that 'understands' copyright code. | | And who cares about copyright if it's like all other hype based | AI currently, unusable in the real world. All the current HN | seems to be bike shedding around legal. Does noone program | anymore? | Zambyte wrote: | Google Books is a great parallel to make with Microsoft's | Copilot. The key differences between the two is | | A) Google Books produces verbatim results 100% of the time, while | Microsoft's Copilot produces verbatim results some N > 0% of the | time (with some % of results greater than N that would be | considered a derivative if a human wrote it), and | | B) Google Books doesn't make the claim that you own the copyright | to any greater than 0% of the search results, while Microsoft's | Copilot makes the claim that you own the copyright to 100% of the | results. | rjzzleep wrote: | If you copy a quote from Google Books you still have to | attribute the original author. It's not magically your text | just because it was hosted on Google Books. Why do they even | compare these two? | | You can compare Github itself with Google Books, but not | copilot. | ghaff wrote: | >If you copy a quote from Google Books you still have to | attribute the original author. | | Mostly because if you don't do so, that's a plagiarism issue | which the law mostly doesn't concern itself with except | insofar as an attributed quote, unless the length is truly | excessive, is likely to be seen as Fair Use while an | unattributed quote, especially if it's more than a minimal | snip, is not. (IANAL, etc.) | Animats wrote: | If you want to kill GitHub CoPilot, start posting useful snippets | of code which contain security backdoors, and wait for CoPilot to | put them in something. | swhalen wrote: | Does the fair use exemption (or an equivalent) exist in all | countries? | xdennis wrote: | Obviously not. Neither does DMCA, but that doesn't stop the USA | from enforcing it worldwide. ___________________________________________________________________ (page generated 2021-07-15 23:02 UTC)