[HN Gopher] GitHub co-pilot as open source code laundering? ___________________________________________________________________ GitHub co-pilot as open source code laundering? Author : agomez314 Score : 859 points Date : 2021-06-30 12:00 UTC (11 hours ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | pabs3 wrote: | There isn't that much enforcement of open source license | violations anyway. I bet there are lots of places where open | source code gets taken, copyright/license headers stripped off | and the code used in something proprietary as well as the bog- | standard "not releasing code for modified versions of Linux" | violation. | cblconfederate wrote: | That's like saying that making a blurry , shaky copy of star wars | is not derivative but original work. Thing is, the 'verbatimness' | of the generated code is positively correlated with the number of | parameters they used to train their model | joshsyn wrote: | people worrying about AI. The AI is still shit. lol | Miner49er wrote: | Microsoft should just GPL CoPilot's code and model. They won't, | but it would fix this problem, I think. | jordemort wrote: | ...unless they've also ingested code that is incompatible with | the GPL and CoPilot ends up regurgitating a mix. | afarviral wrote: | While I think this will continue to amplify current problems | around IP, aren't current applied-ML approaches to writing | software the equivalent of automating the drawing of leaves on a | tree? Maybe a few small branches? But the whole tree, all its | roots, how it fits in to the surrounding landscape, the overall | composition, the intention? If I'm wrong about that than I picked | either a good or a bad time to finally learn programming. There's | only so many ways you can do things in each language though. Just | like in the field of music, only so many "Original" tunes. The | concept of IP is incoherent, you don't own patterns (at least not | at arbitrary depth), though you may be owed some form of | compensation for the billions made off discovering them. | visarga wrote: | You're right, it's only drawing some leaves, the whole tree or | how it relates to the forest is another thing. | tsjq wrote: | Microsoft: embrace, extend, extinguish . | karmasimida wrote: | Well this would not be hard to verify though. | | You can automate this process by providing existing GPL source | code and see what CoPilot comes up next. | | I am sure at some point it WILL produce exact the same code | snippet from certain GPL project, provided that you have | attempted enough times. | | Not sure what the legal interpretation would be though, it is | pretty gray-ish in that regard. | | There would always be risk for CoPilot, had it digested certain | PII information and people found it out...it would be much more | interesting to see the outcome. | Enhex wrote: | it doesn't have to be exact to be copyright infringement, see | non-literal copying. basic idea behind it is if you copy paste | code and rename variables that doesn't mean its new code. | freshhawk wrote: | Yeah, you'd have to assume they are parsing and normalizing | this data in some way. There would still be some AST patterns | or something similar you could look for in the same way, but | it would be much trickier. | | Plus considering this is a legal issue ... good luck with | "there is a statistically significant similarity in AST | outputs related to the most unique sections of this code | base" type arguments in court. We're currently at the "what's | an API" stage of legal tech understanding. | int_19h wrote: | The real question is whether it constitutes _derived work_ | , though. And that is not a question of similarity so much | so as provenance - if you start with a codebase that is GPL | originally, and it gets gradually modified to the point | where it doesn't really look anything like the original, | it's still a derived work, and is still subject to the | license. | | Similarity can be used to prove derivation, but it's not | the only way to do so. In this case, all the code that went | into the model is (presumably) known, so you don't really | need any sort of analysis to prove or disprove it. It is, | rather, a legal question - whether the definition on the | books applies here, or not. | bencollier49 wrote: | This question about the amount of code required to be | copyrightable starts to sound familiar to the copyright | situation with music, where currently the bar seems to be set | too low, legally, to prove plagiarism. | bencollier49 wrote: | Regarding PII, I think you have a very good point. I wouldn't | be surprised to see working AWS_SECRET_KEY values appear in | there. Indeed, given that copypaste programmers may not | understand the code they're given, it's entirely possible that | someone may run code which uses remote resources without the | programmer even realising it. | falcolas wrote: | As per some of the other twitter replies, Co-pilot has offered | to fill in the GPL disclaimer in new files. | mtnGoat wrote: | not a fan of this argument. | | musicians, artists, all kinds of athletes, all grow by watching | observing and learning from others. as if all these open source | projects got to where they are without looking at how others did | things. | | i don't think a single function, similar syntax or basic check | function is worth arguing about, its not like co-pilot is | stealing an entire code base and just plopping it out by reading | your mind and knowing what you want. i know developers that have | certainly stolen code and implementation details from past | employers and that was just fine. | greyman wrote: | > github copilot was trained on open source code and the sum | total of everything it knows was drawn from that code. there is | no possible interpretation of "derivative" that does not include | this | | I don't understand the second sentence, i.e. where's the proof? | Dracophoenix wrote: | This goes into one of my favorite philosophical topics: John | Searle's Chinese Room. I won't go into it here, but the question | of whether an AI is actually learning how to code or simply | substituting information based on statistically common practices | (or if there really is a difference between either) is going to | be one hell of a problem for the next few decades as we start to | approach fine points of what AI is and how it could be defined. | | However, legally, the most recent Oracle vs. Google case has | already settled a major point: APIs don't violate copyright. And | as Github co-pilot is an API (A self-modifying one, but an API | nonetheless), Microsoft has a good defense. | | In the near-future, when we have AI-assisted reverse engineering | along with Github co-pilot, then, with enough obfuscation there's | nothing that can't be legally created or recreated on a computer, | proprietary or not. This is simultaneously free software's | greatest dream and worst nightmare. | | Edit: changed Hilary Putnam to John Searle Edit 2: spelling | toyg wrote: | _> However, legally, the most recent Oracle vs. Google case has | already settled a major point: APIs don 't violate copyright. | And as Github co-pilot is API (A self-modifying one, but an API | nonetheless), Microsoft has a good defense._ | | That's... a mind-bendingly bad take. Google took an API | definition and duplicated it; Copilot is taking _general code_ | and (allegedly) duplicating it. This was not done in order to | enable any sort of interoperability or compatibility. | | The "API defense" would apply if Copilot only produced API- | related code, or (against CP) if someone reproduced the | interfaces copilot exposes to consumers. | | _> Microsoft has a good defense._ | | MS has many good defenses (transformative work, github | agreements, etc etc), but this is not one of them. | [deleted] | cxr wrote: | > the most recent Oracle vs. Google case has already settled a | major point: APIs don't violate copyright. And as Github co- | pilot is API (A self-modifying one, but an API nonetheless), | Microsoft has a good defense | | That's a wild misconstrual of what the courts actually ruled in | Oracle v. Google. | | (And to the reader: don't take cues from people banging out | poorly reasoned quasi-legal arguments in off-the-cuff | comments.) | Dracophoenix wrote: | Straight from the horse's mouth [1]: | | pg.2 | | 'This case implicates two of the limits in the current | Copyright Act. First, the Act provides that copyright | protection cannot extend to "any idea, procedure, process, | system, method of operation, concept, principle, or discovery | . . . ." 17 U. S. C. SS102(b). Second, the Act provides that | a copyright holder may not prevent another person from making | a "fair use" of a copyrighted work. SS107. Google's petition | asks the Court to apply both provisions to the copying at | issue here. To decide no more than is necessary to resolve | this case, the Court assumes for argument's sake that the | copied lines can be copyrighted, and focuses on whether | Google's use of those lines was a "fair use." | | "any idea, procedure, process, system, method of operation, | concept, principle, or discovery" sounds suspiciously like an | API. Continuing: | | Pg. 3-4 | | 'To determine whether Google's limited copying of the API | here constitutes fair use, the Court examines the four | guiding factors set forth in the Copyright Act's fair use | provision... ' | | (1) The nature of the work at issue favors fair use. The | copied lines of code are part of a "user interface" that | provides a way for programmers to access prewritten computer | code through the use of simple commands. As a result, this | code is different from many other types of code, such as the | code that actually instructs the computer to execute a task. | As part of an interface, the copied lines are inherently | bound together with uncopyrightable ideas (the overall | organization of the API) and the creation of new creative | expression (the code independently written by Google)... | | (2) The inquiry into the "the purpose and character" of the | use turns in large measure on whether the copying at issue | was "transformative," i.e., whether it "adds something new, | with a further purpose or different character." Campbell, 510 | U. S., at 579. Google's limited copying of the API is a | transformative use. Google copied only what was needed to | allow programmers to work in a different computing | environment without discarding a portion of a familiar | programming language .... The record demonstrates numerous | ways in which reimplementing an interface can further the | development of computer programs. Google's purpose was | therefore consistent with that creative progress that is the | basic constitutional objective of copyright itself. | | (3) Google copied approximately 11,500 lines of declaring | code from the API, which amounts to virtually all the | declaring code needed to call up hundreds of different tasks. | Those 11,500 lines, however, are only 0.4 percent of the | entire API at issue, which consists of 2.86 million total | lines. In considering "the amount and substantiality of the | portion used" in this case, the 11,500 lines of code should | be viewed as one small part of the considerably greater | whole. As part of an interface, the copied lines of code are | inextricably bound to other lines of code that are accessed | by programmers. Google copied these lines not because of | their creativity or beauty but because they would allow | programmers to bring their skills to a new smartphone | computing environment. The "substantiality" factor will | generally weigh in favor of fair use where, as here, the | amount of copying was tethered to a valid, and | transformative, purpose. | | (4) The fourth statutory factor focuses upon the "effect" of | the cop- ying in the "market for or value of the copyrighted | work." SS107(4). Here the record showed that Google's new | smartphone platform is not a market substitute for Java SE. | The record also showed that Java SE's copyright holder would | benefit from the reimplementation of its interface into a | different market. Finally, enforcing the copyright on these | facts risks causing creativity-related harms to the public. | When taken together, these considerations demonstrate that | the fourth factor--market effects--also weighs in favor of | fair use. | | 'The fact that computer programs are primarily functional | makes it difficult to apply traditional copyright concepts in | that technological world. Applying the principles of the | Court's precedents and Congress' codification of the fair use | doctrine to the distinct copyrighted work here, the Court | concludes that Google's copying of the API to reimplement a | user interface, taking only what was needed to allow users to | put their accrued talents to work in a new and transformative | program, constituted a fair use of that material as a matter | of law. In reaching this result, the Court does not overturn | or modify its earlier cases involving fair use.' | | [1] | https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf | salawat wrote: | That's John Searle's thought experiment actually. Hilary Putnam | had some thoughts in reference to it along the lines that a | brain in a vat might think in a language similar to what we | would speak, but the words of that language would necessarily | encode different meanings due to the different experience of | the external world and sensory isolation. | | https://plato.stanford.edu/entries/chinese-room/ | Dracophoenix wrote: | Thanks for the correction. I made it known in my edit. | AJ007 wrote: | And this applies to everything, not just source code. | | I'm just presuming we have a future where you can consume | unique content indefinitely. Such as instead of binge watching | Star Trek on Netflix you press play and new episodes are | generated and played continuously, 24/7, and they are actually | really good. | | Thus intellectual property becomes a commodity. | Dracophoenix wrote: | While headway has been made in photo algorithms like | StyleGAN, GPT-3's scriptwriting, and AI voice replication, we | aren't even close to having AI-generated stick cartoons or | anime. At best, AI generated Star Trek trained on old | episodes would produce the live-action equivalent of limited | animation; it would reuse the most liked parts over an over | again and rehash the same camerawork and lens focus that you | got in the 60's and the 90's. There wouldn't be any new | planets explored, no new species, no advances in | cinematography, and certainly no self-insert character (in | case you wanted to see - simulation of how you'd fair on the | Enterprise). It wouldn't add anything new as far as I can | see. Now if there was some way to recreate all the characters | in photorealistic 3D with Unreal Engine, feed them a script, | and use some form of intelligent creature and planet | generation, you may get a little closer to creating a truly | new episode. | koonsolo wrote: | Does this mean that when I read GPL code and learn from it, I | cannot use these learnings in non-GPL code? | | I get it that the derivative work might be more clear in an AI | setting, but basically it boils down to the same thing. | agomez314 wrote: | Posting this due to the recent unveiling of GitHub Co-pilot and | the intersection on the ethics of ml training set data. | 6gvONxR4sf7o wrote: | > previous """AI""" generation has been trained on public text | and photos, which are harder to make copyright claims on, but | this is drawn from large bodies of work with very explicit court- | tested licenses | | This seems pretty backwards to me. A GPL licensed data point is | more permissive than an unlicensed data point. | | That said, I'm glad that these data points do have explicit | licenses that say "if you use this, you must do XYZ" so that it's | clear that our large ML projects are going counter to creators | intent when they made it open. | | I'd love to start seeing licenses about use as training data. | Then maybe we'd see more open access to these models that benefit | from the openness of the web. I'd personally use licenses that | say if you want to train on my work, you must publish the model. | That goes for my code, my writing, and my photography. | | Anyways GitHub is arguing that any use of publicly available data | for training is fair use, but they also admit that it's all new | and unprecedented, regarding training data. | zeptonix wrote: | The tone of the responses here is absurd. Guys, be grateful for | some progress. Instead of having to retype boilerplate code, your | productivity is now enhanced by having a system that can do it | for you. This is primarily about reducing the need to re-type | total boilerplate and/or copy/paste from Stackoverflow. If you | were to let some of the people here run things we'd never have | any form of progress with anything ever. | joepie91_ wrote: | > Instead of having to retype boilerplate code, your | productivity is now enhanced by having a system that can do it | for you | | We already invented something for that a couple decades ago, | and it's called a "library". And unlike this thing, libraries | don't launder appropriation of the public commons with total | disregard for those who have actually _built_ that commons. | qayxc wrote: | Questions like this go much deeper and illustrate issues that | need to be addressed before the technology becomes standard and | widely adopted. | | It's not about progress or supressing it, it's a fundamental | question about whether it is OK for huge companies to profit | from the work of others without as much as giving credit, and | if using AI this way represents an instance of doing so. | | The latter aspect goes beyond productivity or licensing - the | OP asserts that AI isn't equivalent to a student who learned | from examples how to perform a task, but rather replicates | (recalls) or reproduces the works of others (e.g. the training | material). | | It's a question that goes beyond this particular application: | what about GAN-based generators? Do they merely reproduce | slight variations of the training material? If so, wouldn't the | authors of the training material have some kind of intellectual | property rights to the generated works? | | This doesn't just concern code snippets, it's a general | question about AI, crediting creators, and circumventing | licensing and intellectual property rights. | fartcannon wrote: | To me, this is similar to all these big org making money off our | data. They should be paying us to profit off our minds. | kingsuper20 wrote: | I was just musing about whether this kind of tool has been | written (or is being written) for music composition, business | letter writing, poetry, news copy. | | Interesting copyright issues. | | Anyone who thinks their profession will continue as-is for the | long term is probably mistaken. | sipos wrote: | So, I can't see how they can argue that the code generated is not | a derrivative of at least some of the code that it was trained | on, and therefore encumbered by a complicated, and for anyone | other than GitHub, impossible to disentangle, copyright claims. | If they haven't even been careful to only use software under one | license that does not require the original author to be | attributed, then I don't see how it can even be legal for them to | be running the service. | | All that said, I'm not confident that anyone will stop them in | court anyway. This hasn't tenmded to be very easy when companies | infringe other open source code copyright terms. | | Until it is cleared up though, it would seem extremely unwise for | anyone to use any code from it. | kklisura wrote: | Should we be changing our open source licenses to explicitly | prevent training such systems using our code? | onli wrote: | I'd assume this: In the same way as you can not forbid a human | to learn concepts from your code, you can not forbid an | automated system to learn concepts from your code, regardless | the license. Also, if you would it would make your code non- | free. | | At least as long as the system really learns concepts. If it | just copy & pastes code, then that's a different story (same as | with humans). | bencollier49 wrote: | Good idea, but if carved up into small enough chunks, it may be | considered fair use. | | What is confusing is that the neural net may take lots of small | chunks and link them to one another, and then reproduce them in | the same order verbatim. | falcolas wrote: | One of the examples pointed out in the reply threads was the | suggestion in a new file to insert the GPL disclaimer header. | | So, the length of the samples being drawn is not necessarily | small: the chunk size is based on its commonality. It could | easily be long enough to trigger a copyright violation. | svaha1728 wrote: | With music sampling, copyright protects down to the sound of | a kick drum. No doubt Microsoft has a good set of attorneys | working on their arguments as we speak. | joepie91_ wrote: | That would be a legal no-op. Either their use _is_ covered by | copyright and they are violating your license, or it _isn 't_ | covered by copyright and then any constraints that your license | sets are meaningless. | | Licenses hold no power outside of that granted to it by things | being copyrighted by default. | slim wrote: | Why forbid? Just use GPL and extend the contagion to the code | trained using your code | k__ wrote: | I don't think so. | | The code that is already used to train should be problematic | for them, not only new Code in the future. | 6510 wrote: | Time to make closed source illegal. | naikrovek wrote: | I don't see the point of this tool, independent of the resulting | code being derivative of GPL code or not. | | being able to produce valid code is not the bottleneck of any | developer effort. no projects fail because code can't be typed | quickly enough. | | the bottleneck is understanding how the code works, how to design | things correctly, how to make changes in accordance with the | existing design, how to troubleshoot existing code, etc. | | this tool doesn't make anything any easier! it makes things | harder, because now you have running software that was written by | no one and is understood by no one. | mslm wrote: | Have to fully agree; just seems like a "cool" tool where if you | had to actually use it for real world projects, it's going to | slow you down significantly, and you'll only admit it once the | honeymoon period is over. | fckthisguy wrote: | Whilst I absolutely agree that writing code fast enough isn't | the bottleneck, it's always nice to have tools that reduce | repeat code writing. | | I use the React plugin for Webstorm to avoid having to write | the boilerplate for FCs. Maybe in the future Copilot will | replace that usage. | ImprobableTruth wrote: | To me that - and really any form of common boilerplate - is | just evidence that we're lacking abstractions. If your editor | is generating code for you, that means that the 'real' | programming language you're using 'in your head' has some | metaprogramming facilities emulated by your IDE. | | I think we should strive to improve our programming languages | to make less of this boilerplate necessary, not to make | generating boiler plate easier. The latter is just going to | make software less and less wieldy. Imagine the horror if | instead of (relatively) higher level programming languages | like C we were all just using assembly with code generation. | izgzhen wrote: | It doesn't calm to solve the bottleneck either. On the | contrary, it clearly states that its mission is to solve the | easy parts better so developers can focus better on the true | challenging engineering problems as you mentioned. | uncomputation wrote: | This reminds me of a startup pitch where it's always "oh we | take care of x so you don't have to," but the problem is now | I just have _another_ thing to take care of. I cannot speak | for people who use Copilot "fluently," but I know for every | chunk of code it spat out I would need to read every line and | make sure "Is this right? Is the return type what I want? | Will this loop terminate? Is 'scan' the right API? Is that | string formatted properly? Can I optimize this?" etc. To me | it's hardly "solving the easy parts," but rather putting the | passenger's hands on the wheel. | gotostatement wrote: | Upvoted. I think the only good use case for this is | spitting out 10-line, annoying, commonly used API | boilerplate for commonly used APIs | izgzhen wrote: | That is a valid use case despite being small and | incremental. I think it will still be helpful to some | people. | izgzhen wrote: | The easy part is the copy-paste-from-SO part ;) | bobsomers wrote: | Completely agree. If anything, I see tools like this actually | decreasing engineering speed. I don't see how it doesn't lead | to shipping large quantities of code the team didn't vet | carefuly, which has is a recipe for subtle and hard to find | bugs. Those kinds of bugs are much more expensive to find a | squash. | | What we really need aren't tools that help us write code | faster, but tools that help us understand the design of our | systems and the interaction complexity of that design. | pjfin123 wrote: | I think this would fall under any reasonable definition of fair | use. If I read GPL (or proprietary) code as a human I still own | code that I later write. If copyright was enforced on the outputs | of machine learning models based on all content they were trained | on it would be incredibly stifling to innovation. Requiring | obtaining legal access to data for training but full ownership of | output seems like a sensible middle ground. | | (Reposting my comment from yesterday) | sanderjd wrote: | Reposting a summary of my reply: if you memorize a line of code | and then write it down somewhere else without attribution, that | is not fair use, you copied that line of code. If this model | does the same, it is the same. | rbarrois wrote: | An interesting impact of this discussion is, for me: within my | team at work, we're likely to forbid any use of Github co-pilot | for our codebase, unless we can get a formal guarantee from | Github that the generated code is actually valid for us to use. | | By the way, code generated by Github co-pilot is likely | incompatible with Microsoft's Contribution License Agreement [1]: | "You represent that each of Your Submission is entirely Your | original work". | | This means that, for most open-source projects, code generated by | Github co-pilot is, right now, NOT acceptable in the project. | | [1] https://opensource.microsoft.com/pdf/microsoft- | contribution-... | CharlesW wrote: | > _This means that, for most open-source projects, code | generated by Github co-pilot is, right now, NOT acceptable in | the project._ | | For this scenario, how is using Co-Pilot generated code | different from using code based on sample code, Stack Overflow | answers, etc.? | rbarrois wrote: | I'd say that it depends on the license; for StackOverflow, | it's CC-BY-SA 4.0 [1]. For sample code, that would depend on | the license of the original documentation. | | My point is: when I'm copying code from a source with an | explicit license, I know whether I'm allowed to copy it. If I | pick code from co-pilot, I have no idea (until tested by law | in my jurisdiction) whether said code is public domain, AGPL, | proprietary, infringing on some company's copyright. | | [1] https://stackoverflow.com/legal/terms-of- | service#licensing | CharlesW wrote: | That makes sense, thank you. | gwenzek wrote: | A number of company, including Google and probably Microsoft | forbid copying code from Stack Overflow because there is no | explicit license | CharlesW wrote: | TIL, thank you! | gdsdfe wrote: | How would you know if copilot was used or not?! | LeicaLatte wrote: | Our software has violated the world and people's lives legally | and illegally in many instances. I mean none of us cared when | GPT-3 did the same for text on the internet. :) | | Reminder - Software engineers, our codes, GPLs are not special. | 29athrowaway wrote: | If I recall correctly, it has been already determined that using | proprietary data to train a machine learning system is not a | violation of intellectual property. | nemoniac wrote: | So as I understand it, AGPL was introduced to cover an unforeseen | loophole in GPL that adapted code could be used to power a web | service. Could another new version of the license block allowing | code from use to train such GitHub co-pilot like models? | corobo wrote: | If I as an alleged human have learned purely from GPL code would | that require code I write to be released under the GPL too? | | We should probably start thinking about AI rights at some point. | Personally I'll be crediting GPT-3 as any other contributor | because it sounds cool but maybe morally too in future | notkaiho wrote: | Unless you were using structures directly from said code, | probably not? | | Compare if you had only learned writing from, say, the Bible. | You would probably write in a very Biblical manner, but would | you write the Psalms exactly? Most likely not. | edenhyacinth wrote: | We have seen Co-Pilot directly output | (https://docs.github.com/en/github/copilot/research- | recitatio...) the zen of python when prompted - there's no | reason it wouldn't write the Psalms exactly when prompted in | the right manner. | disgruntledphd2 wrote: | That's super cool. As long as you do the things you specify | at the bottom of that doc (provide attribution if copied so | people can know if it's OK to use) then a lot of the | concerns of people on these threads are going to be | resolved. | edenhyacinth wrote: | Pretty much! There's only three major fears remaining | | * Co-pilot fails to detect it, and you have a potential | lawsuit/ethical concern when someone finds out. Although | the devil on my shoulder says that if Co-pilot didn't | detect it, what's to say another tool will? | | * Co-pilot reuses code in a way that still violates | copyright, but is difficult to detect. I.e. If you | checked via a syntax tree, you'd notice that the code was | the same, but if you looked at it as raw text, you | wouldn't. | | * Purely ethical - is it right to take licensed code and | condense it into a product, without having to take into | account the wishes of the original creators? It might be | treated as normal that other coders will read it, and | pick up on it, but when these licenses were written no | one saw products like this coming about. They never | assumed that a single person could read all their code, | memorise it, and quote it near-verbatim on command. | disgruntledphd2 wrote: | > Purely ethical - is it right to take licensed code and | condense it into a product, without having to take into | account the wishes of the original creators? It might be | treated as normal that other coders will read it, and | pick up on it, but when these licenses were written no | one saw products like this coming about. They never | assumed that a single person could read all their code, | memorise it, and quote it near-verbatim on command. | | It's gonna be really interesting to see how this plays | out. | corobo wrote: | I've not seen Copilot in action yet, I was under the | impression it doesn't use code directly. | | In any case my original question was answered by the tweeter | in a later tweet I missed | https://twitter.com/eevee/status/1410049195067674625 | | I get where they're coming from but they are kinda just | handwaving it back the other way with the "u fell for | marketing idiot" vibe. I wish someone smarter than me could | simplify the legal ramifications around this but we'll | probably have to wait till it kills someone (or at least | costs someone a bunch of money) to get any actual laws set | up. | lucideer wrote: | Your question had already been preempted in the OP. | Specifically: | | > _" but eevee, humans also learn by reading open source code, | so isn't that the same thing"_ | | > _- no_ | | > _- humans are capable of abstract understanding and have a | breadth of other knowledge to draw from_ | | > _- statistical models do not_ | | > _- you have fallen for marketing_ | | -- https://twitter.com/eevee/status/1410049195067674625 | corobo wrote: | I preemptively commented that I'd seen that tweet three hours | before your comment figuring someone was going to quote it at | me haha | | Preemptive, doesn't work as it turns out :) | | https://news.ycombinator.com/item?id=27687586 | lucideer wrote: | Nice catch | pyentropy wrote: | That's what I wanted to ask, where do we draw the line of | copyright when it comes to inputs of generative ML? | | It's perfectly fine for me to develop programming skills by | reading _any code_ regardless of the license. When a corp | snatches an employee from competitors, they get to keep their | skills even if they signed an NDA and can 't talk about what | they worked on. On the other hand there's the no-compete | agreement, where you can't. Good luck making a no-compete | agreement with a neural network. | | Even if someone feeds stolen or illegal data as an input | dataset to gain advantage in ML, how do we even prove it if | we're only given the trained model and it generalizes well? | vsareto wrote: | >how do we even prove it if we're only given the trained | model and it generalizes well? | | Someone's going to have to audit the model the training and | the data that does it. There's a documentary on black holes | on Netflix that did something similar (no idea if it was AI) | but each team wrote code to interpret the data independently | and without collaboration or hints or information leakage, | and they were all within a certain accuracy of one-another | for interpreting the raw data at the end of it. | | So, as an example, if I can't train something in parallel and | get similar results to an already trained model, we know | something is up and there is missing or altered data (at | least I think that's how it works). | hliyan wrote: | Copyright is going to get very muddy in the next few decades. | ML systems may be able to generate entire novels in the | styles of books they have digested, with only some assist | from human editors. True of artwork and music, and perhaps | eventually video too. Determining "similarity" too, may soon | have to be taken off the hands of the judge and given to | another ML system. | agomez314 wrote: | Take it further. You could easily imagine taking a service | like this as an invisible middleware behind a front-end and | start asking users to pay for the service. Some could argue | it's code generation attributable to those who created the | model, but reality is that the models were trained by code | written by thousand of passionate users at no pay with the | intent of free usage. | bogwog wrote: | > but reality is that the models were trained by code | written by thousand of passionate users at no pay with the | intent of free usage. | | I hope you're actually reading those LICENSE files before | using open source code in your projects. | rhn_mk1 wrote: | > It's perfectly fine for me to develop programming skills by | reading any code regardless of the license. | | I'd be inclined to agree with this, but whenever a high | profile leak of source code happens, reading that code can | have dire consequences for reverse engineers. It turns clean | room reverse engineering into something derivative, as if the | code that was read had the ability to infected whatever the | programmer wrote later. | | A situation involving the above developed in the ReactOS | project https://en.wikipedia.org/wiki/ReactOS#Internal_audit | kitsune_ wrote: | I think you are missing the mark here with this comparison, | Copilot and its network weights are already the derived work, | not just the output it produces. | wilde wrote: | Possibly. We won't know until this is tested in court. | Traditionally one would want to clean room [1] this sort of | thing. Co-pilot is...really dirty by those standards. | | [1] https://en.wikipedia.org/wiki/Clean_room_design | edenhyacinth wrote: | A machine learning isn't really the same as a person learning - | people generally can code at a high level without having first | read TBs of code, nor can you reasonably expect a person to | have memorised GPL code to reproduce it on demand. | | What you can expect a person to do is understand the principles | behind that GPL code, and write something along the same lines. | GitHub Co-Pilot is not a general ai, and it's not touted as | one, so we shouldn't be considering whether it really _knows_ | code principles, only that it can reliably output code that | fits a similar function to what came before, which could | reasonably include entire blocks of GPL code. | corobo wrote: | Well if it is actually straight up outputting blocks of | existing code then get it in the bin as a failed attempt to | sprinkle AI on development and use this instead | | https://github.com/drathier/stack-overflow-import | schnebbau wrote: | Newsflash everyone, if you open source your code it's going to be | copied or paraphrased anyway. | nextaccountic wrote: | It should be copied and paraphrased, but respecting the | license. This means, among other things, crediting the author. | schnebbau wrote: | It may be hard to believe, but there are sick and twisted | individuals in this dangerous world who copy from github | without even a single glance at the license, and they live | among us. | iKevinShah wrote: | There are always exceptions (maybe they might be norm in | this case) but its still not 100%, still not all | encompassing. This "AI" seems to be. I think that is like | the entire concern. ALL the code is affected for all the | instances. | adn wrote: | Yes, and those people are violating the licenses of the | code when they do that. It's not unreasonable to expect a | massive company like microsoft to not do this on a massive | scale. | diffeomorphism wrote: | What does that have to with the topic? The question is not | whether it gets copied, the question is whether it gets | pirated. | postalrat wrote: | I think the issue many people may have with this is it's a | proprietary tool that profits on work it was not licensed to | use this way. | GuB-42 wrote: | Yes that's the point. | | But if I do it under a copyleft license like GPL, I expect | those who copy to abide by the license and open source their | own code too. | | But sure, people shit on IP rights all the time, and I am | guilty of it too. Let's say I didn't pay what I should have | paid for every piece of software I have used. | Closi wrote: | "About 0.1% the snippets are verbatim" | | This implies that by just changing the variable names, the | snippets are classed as non-verbatim. | | I don't buy that this number is anywhere close to the actual | figure if you assume that you can't just change function names | and variable names and suddenly say you have escaped both the | legality and the spirit of GPL. | jordemort wrote: | What happens when someone puts code up on GitHub with a license | that says "This code may not be used for training a code | generation model"? | | - Is GitHub actually going to pay any attention to that, or are | they just going to ingest the code and thus violate its license | anyway? | | - If they go ahead and violate the code's license, what are the | legal repercussions for the resulting model? Can a model be "un- | trained" from a particular piece of code, or would the whole | thing need to be thrown out? | vbezhenar wrote: | I expect them to check /LICENSE file and if it deviates from | standard open source license, they'll skip that repository. | jordemort wrote: | They haven't made any public statements on if they're looking | at LICENSE or not; I'd sure appreciate it if they did! | anfelor wrote: | They don't do that it seems. In the footnotes of | https://docs.github.com/en/github/copilot/research- | recitatio... they mention two repositories from the training | set none of which specify a license. | cxr wrote: | The existence of a LICENSE file is neither necessary nor | sufficient to determine the terms that apply to a given work. | diffeomorphism wrote: | Why not? If it does not exist you treat it as proprietary | (copyrighted by default) and if it does exist at least the | author claims that the given license is an option (possibly | their mistake, not mine) | junon wrote: | Because individual source files might have license | headers that override the root license file in the | repository. | all_rights_rsvd wrote: | I post my code publicly, but with an "all rights reserved" | licence. I don't mind everyone reading my code freely, but you | can't use it for anything but learning. If I found out they | were ingesting my code I would be angry. It's like training | your replacement. I don't use GitHub, anyways, but now I'll | definitely never even think about it. | toyg wrote: | Technically then I'm infringing as soon as I clone your repo, | possibly even as soon as a webserver sends your files to my | browser. | | "All rights reserved" makes sense on final items, like books | or physical records, that require no copy or change after | owner-approved manufacturing has taken place. It doesn't | really make sense on digital artefacts. | all_rights_rsvd wrote: | So don't clone it, read it online. I reserve all rights, | but I do give license to my host to make a "copy" to let | you view it. I do that specifically to prevent non- | biological entities like corporations or AI from using my | code. If you're a biological entity, I specify you can | email me to get a license for my code for a specific, | defined purpose. I have a conversation with that person, | then I send them a record number and the terms of my | license for them in which I grant some rights which I had | reserved. | | Also, in your example, the copyright for the book or dvd is | for the content, not the physical item. You can do anything | you want with that item but not the content. My code is | similar, I'm licensing my provider to serve you a visual | representation of the files so you can experience the | content, not giving you a license to run that code or use | it otherwise. | [deleted] | uchiga wrote: | If it is public, its no longer your code. Its ai training | material. | cortesoft wrote: | Also, how would you know if your code was included in the | training or not? | | Then, let's say the AI generates some new code for someone, and | it is nearly identical to some bit of code that you wrote in | your project. | | If they didn't use your code in the model, then the generated | code is clearly not a copyright violation, since it was | effectively a "clean room" recreation. | | If your code was included in the model, is it therefore a | violation? | | But then again, it comes down to how can someone prove their | code was included or not? | | What if the creators don't even know? If you wrote your model | to say, randomly grab 50% of all public repos to use in the | model, then no one would know if a specific repo was used in | the training. | invokestatic wrote: | By uploading your content to GitHub, you've granted them a | license to use that content to "improve the Service over time", | as specified in the ToS[1]. | | That effectively "overrides" any license or term that you've | specified for your repository, since you've already licensed | the content to GitHub under different terms. Of course, people | who are not GitHub are beholden to the terms you specify. | | [1] https://docs.github.com/en/github/site-policy/github- | terms-o... | nitwit005 wrote: | I rather suspect judges would not see "improving the Service | over time" as permission to create derivative works without | compensation. | | The person uploading files to github is also not necessarily | doing so with permission from the rights holder, which might | be a violation of the terms of service, but would mean | there's no agreement in place. | diffeomorphism wrote: | How is this different from uploading a hollywood movie to | youtube? Just because there is a passage in the terms that | the uploader supposedly gave them those rights, this does not | mean they actually have the power to do that. | jcranmer wrote: | You can't give Github or Youtube or anybody else copyright | rights if you don't have them in the first place. This is | what ultimately torpedoed "Happy Birthday" copyright | claims: while it's pretty undisputed that the Hill sisters | gave their copyright to (ultimately) Warner/Chapelle, it | was the case that they actually _didn 't_ invent the | lyrics, and thus Warner/Chapelle had no copyright over the | lyrics. | | So if someone uploads a Hollywood movie to Youtube, Youtube | doesn't get the rights to play that movie from them because | they didn't have the rights in the first place. Of course, | if the actual copyright owner uploads it, it's now | permissible for Youtube to play it, even if it's the copy | that someone else provided. [This has torpedoed a few | filesharing lawsuits.] | macinjosh wrote: | Not sure how much it would matter but the main difference I | see is that if I upload my own code to GitHub I have the | ability to give away the IP, but if I upload Avengers End | Game to YouTube I don't have the right to give that away. | makeitdouble wrote: | I wonder how it would work if we consider you flagged | your code as GPL before it hits Github. | | We could end up in the same situation as the Hollywood | movie even if you are also the one setting the original | license on the work. Basically you have a right to change | the license, but it doesn't mean you do. | im3w1l wrote: | A very plausible scenario: Alice creates GPL project. Bob | forks it and uploads to github. Bob does not have a right | to relicense Alices' parts. | Hamuko wrote: | I sort of doubt that GitHub could include GPL code in a piece | of closed-source program that they distribute that "improves | the service" and claim that this gives them the right. | amelius wrote: | > By uploading your content to GitHub, you've granted them a | license to use that content to "improve the Service over | time", as specified in the ToS. | | That's nonsense because they could claim that for almost any | reason. | | E.g. assume Google put the source code of Google search in | Github. Then Github copies that code and uses it in their own | search, since that "improves the service". Would that be | legal? | | It's like selling a pen and claiming the rights to anything | written with it. | invokestatic wrote: | If the pen was sold with a contract that said the seller | has the rights to anything written with it, then yes. These | types of contracts are actually quite common, for example | an employment contract will almost certainly include an IP | grant clause. Pretty much any website that hosts user- | generated content as well. IANAL, but quite familiar with | business law. | joepie91_ wrote: | > These types of contracts are actually quite common, for | example an employment contract will almost certainly | include an IP grant clause. | | In the US, maybe. In most of the rest of the world, these | sorts of overreaching "we own everything you do anywhere" | clauses are decidedly illegal. | lucideer wrote: | The use of the definition _Your Content_ may make GitHub 's | own ToS legally invalid in a large number of cases as it | implies that the uploader must be the sole author and "owner" | of the code being uploaded. | | From the definitions section in the same doc: | | > _" Your Content" is Content that you create or own._ | | That will definitely exclude any mirrored open-source | projects, any open-source project that has ever migrated to | Github from another platform, and also many forked projects. | carlosperate wrote: | Good point, to me that explains why this is a GitHub product | instead of a Microsoft (or VSCode) product. | joeyh wrote: | Anyone can upload someone else's freely licensed code to | github. Without giving them such a license. | | I do not upload my code to github, or give them any special | permissions, and I am confident my code was included in the | model's corpus. | jordemort wrote: | I think more specifically, the relevant bit is here: | https://docs.github.com/en/github/site-policy/github- | terms-o... | | > We need the legal right to do things like host Your | Content, publish it, and share it. You grant us and our legal | successors the right to store, archive, parse, and display | Your Content, and make incidental copies, as necessary to | provide the Service, including improving the Service over | time. This license includes the right to do things like copy | it to our database and make backups; show it to you and other | users; parse it into a search index or otherwise analyze it | on our servers; share it with other users; and perform it, in | case Your Content is something like music or video. | | But, it goes on to say: | | > This license does not grant GitHub the right to sell Your | Content. It also does not grant GitHub the right to otherwise | distribute or use Your Content outside of our provision of | the Service, except that as part of the right to archive Your | Content, GitHub may permit our partners to store and archive | Your Content in public repositories in connection with the | GitHub Arctic Code Vault and GitHub Archive Program. | | I'm not a lawyer, but it seems ambiguous to me if this ToS is | sufficient to cover CoPilot's butt in corner cases; I bet at | least one lawyer is going to make some money trying to answer | the question. | buu700 wrote: | IANAL, but I wouldn't read that as granting GitHub the | right to do anything like this. There's definitely a | reasonable argument to be had here, but I think limiting | the grant of rights to incidental copies should trump | "[...] or otherwise analyze it on our servers" and what | they're allowed to do with the results of that analysis. | | On the extreme end, "analysis" is so broad that it could | arguably cover breaking down a file of code into its | constituent methods and just saving the ASTs of those | methods verbatim for Copilot to regurgitate. That's | obviously not an acceptable outcome of these terms per se, | but arguably isn't any different in principle from what | they're already doing. | | Ultimately, as I understand, courts tend to prefer a common | sense outcome based on a reasonable human understanding of | the law, rather than an outcome that may be defensible | through some arcane technical logic but is absurd on its | face and counter to the intent of the law. If a party were | harmed by an instance of Copilot-generated copyright | infringement, I don't see a court siding with this tenuous | interpretation of the ToS over the explicit terms of the | source code license. On the other hand, it would probably | also be impossible to prove damages without something like | a case of verbatim reproduction, similarly to how having a | developer move from working on proprietary code for one | company to another isn't automatically copyright | infringement. | | I doubt that GitHub is doing anything as blatantly | malicious as copying snippets of (GPL or proprietary) code | to explicitly reuse verbatim, but if they're learning from | license-restricted code at all then I don't see how they | wouldn't be subjecting themselves and/or consumers of | Copilot to the same risk. | yaitsyaboi wrote: | Wait so does this mean a "private repo" is meaningless and | GitHub can share any code in any repo with anyone? | ipaddr wrote: | That is not even the right question. | | Why are developers so myopic around big tech? Of course | they can. Facebook can use your private photos. It's in | their terms and services. Cloud providers have more | generous terms. | | The response has always been they won't do that because | they have a reputation to manage. The further they grow | the further they control the narrative so the less this | matters. | | Wait until you find out they sell your data or use your | data to sell products. | | Why in 2021 are we giving Microsoft all of our code? It | seems like the 90s, 2000s never happened and we all trust | microsoft. They have a free editor and a free operating | system that sends packets of activity the user does back | to microsoft but that's okay.. we want to help improve | their products? We trust them. | sandyarmstrong wrote: | Why do you think people care so much about end-to-end | encrypted messaging? | | Yes, the concept of a "private" repo is enforced only by | GitHub's service. A bug in their auth code could lead to | others having access. A warrant could lead to others | having access. Etc. | cercatrova wrote: | Of course. A "private" repo is still on their servers. | It's only private from other GitHub users, not the actual | site administrators. This is the same in any website, of | course the admins can see everything. If you truly want | privacy, use your own git servers. | ocdtrekkie wrote: | Fun fact: Every major cloud provider has a similar | blanket term. For example, Google doesn't need to license | music to use for promotional content, because YouTube's | terms grant them a worldwide license to use uploaded | content for purposes including promoting their services, | and music labels can't afford to not be on YouTube. (It's | probable even uploading content to protect it, as in | Content ID, would arguably cause this term to apply.) | | It all comes down to the nuance of whether the usage | counts as part of protecting or improving (or promoting) | their services and what other terms are specified. | vageli wrote: | No. | | > GitHub may permit our partners to store and archive | Your Content in public repositories in connection | z3ncyberpunk wrote: | Hey... want to take a guess why Microsoft lets you have | unlimited free private repos when they bought GitHub? ;) | notatoad wrote: | yes, that's what that specific section means, but as | always with these documents you can't just extract a | single section, you need to take the document as a whole | (and usually, more than one document - ToS privacy policy | are usually different) | | these documents are structured as granting the service | provider extremely broad rights, and then the rest of the | document takes away portions of those rights. so in this | case they claim the right to share any code in any repo | with anyone, and then somewhere else they specify which | code they won't share, and with whom they won't share it. | 2OEH8eoCRo0 wrote: | It's aggravating that there is no escape. If you host | somewhere else it will be scraped. If you pay for the service | it will be used. | antattack wrote: | That does not mean that you give them license to your code. | In fact some or all of the code may not be yours to give in a | first place. | sipos wrote: | Seems like a good reason to never use GitHub, and encourage | other people not to. | rjp0008 wrote: | I would bet this as applicable as the Facebook posts of my | parents friends something like, 'All my content on this page is | mine alone and I expressly forbid Facebook INC usage of it for | any purpose.' | jordemort wrote: | I'm not sure why it would be any less binding than any other | license term, except for possibly the ToS loophole that | invokestatic points out below. | willseth wrote: | It's not binding because the other party hasn't agreed. You | agree to terms when you use the site. One party can't | unilaterally change the agreement without consent of the | other party. | jordemort wrote: | I see where you're coming from but it's not quite the | same thing; Facebook doesn't encourage people to choose a | license for the content that they post there, so there's | no expectation that there are any terms aside from those | in Facebook's ToS. OTOH GitHub has historically very | strongly encouraged users to add a LICENSE to their | repositories, and also encouraged users to fork other | people's code and and push it to GitHub. That GitHub | would be exempt from the licensing terms of the code | pushed to it, except for the obvious minimal extent they | might need to be in order to provide their services, | seems like an extremely surprising interpretation. | Avamander wrote: | Someone might have published a project I've contributed | to, on GitHub. There's no permission. | moolcool wrote: | NO COPYRIGHT INTENDED | mattdesl wrote: | By submitting any textual content (GPL or otherwise) on the web, | you are placing it in an environment where it will be consumed | and digested (by human brains and machine learning algorithms | alike). There is already legal precedent set for this which | allows its use in training machine learning algorithms, | specifically with heavily copyrighted material from books[1]. | | This does not mean that any GitHub Co-Pilot produced code is | suddenly free of license or patent concerns. If the code produces | something that matches too closely GPL or otherwise licensed code | on a particularly notable algorithm (such as video encoder), you | may still be in a difficult legal situation. | | You are in essence using "not-your-own-code" by relying on | CoPilot, which introduces a risk that the code may not be | patent/license free, and you should be aware of the risk if you | are using this tool to develop commercial software. | | The main issue here is that many average developers may continue | to stamp their libraries as MIT/BSD, even though the CoPilot- | produced code may not adhere to that license. If the end result | is that much of the OSS ecosystem becomes muddied and tainted, | this could slowly erode trust in open licenses on GitHub (i.e. | the implications would be that open source libraries could become | less widely used in commercial applications). | | [1] - https://towardsdatascience.com/the-most-important-supreme- | co... | akagusu wrote: | For years people have warned about hosting the majority of | world's open source code in a proprietary platform that belongs | to a for profit company. These people were called lunatics, | fundamentalists, radicals, conspiracy theorists, and many other | names. | | Well, they were ignored and this is the result. A for profit | company built a proprietary system using every code hosted in its | platform without respecting the code license. | | There will be a lot of people saying this is not a license | violation but it is, and more, it is an exploitation of other | people work. | | Right now I'm asking myself when people will stop supporting | these kind of company that exploit people's work without giving | anything in return to people and society while making a huge | amount of profit. | sergiomattei wrote: | If we feed the entirety of a library to an AI and have it | generate new books, is it an exploitation of people's work? | | If we read a book and use its instructions to build a bicycle, | is it an exploitation of people's work? | | No, no it's not. | yunohn wrote: | It's astonishing to me that HN+Twitter believe that Github | designed this entire project, without speaking to their legal | team and confirming that training on GPL code would be possible. | | Mind-blowingly hilarious armchair criticism. | darkerside wrote: | The conclusion seems a bit unfair. | | > "but eevee, humans also learn by reading open source code, so | isn't that the same thing" - no - humans are capable of abstract | understanding and have a breadth of other knowledge to draw from | - statistical models do not - you have fallen for marketing | | Machines will draw on other sources of knowledge besides the GPL | code. Whether they have the capacity for "abstract thought" is | probably up for debate. There's not much else said in those | bullets. It's not a good argument. | goodpoint wrote: | What is more concerning is that the training kernel belongs | exclusively one private company. Microsoft. | | It can become a massive (and unfair) competitive advantage. | | Furthermore, Copilot will not work with less popular languages | and also prevent popular languages from evolving. | bastardoperator wrote: | Is this true? Looks like they're using the OpenAI Codex which | is set to be released soon: | | https://openai.com/ | giansegato wrote: | This feature is effectively impossible to replicate. Only | Microsoft positioned itself to have: - dataset (GitHub) - tech | (openai) - training (azure) - platform (vscode) | | I'm impressed. They did an amazing job from a corporate | strategy standpoint. Also directionally things are getting | interesting | djrogers wrote: | The dataset is all freely available open source code, right? | Just because GH hosts it doesn't mean the rest of the world | can't use it for the same purpose. | handrous wrote: | They'd find a way to keep it practically difficult to use, | at the least, if that dataset is vital to the process. | Hoarding datasets that should either be wholly public _or_ | unavailable for any kind of exploitation is the _backbone_ | of 21st century big tech. It 's how they make money, and | how they maintain (very, very deep) moats against | competition. | | [EDIT] actually, I suspect their play here will be to open | up the public data but own the best and most low-friction | implementation, then add terms that let them also feed | their algo with _proprietary_ code built using their | editors. That part won 't be freely available, and no free | version will be able to provide that further-improved | model, even assuming all the software to build it is open- | source. Assuming using this thing ends up being a | significant advantage (so, assuming this matters at all) | your choice will be to either hamstring yourself in the | market or to help Microsoft build their dataset. | midoBB wrote: | You'd have to hit rate limiting multiple times no? | unfunco wrote: | https://console.cloud.google.com/marketplace/product/gith | ub/... | | BigQuery used to have a dataset updated weekly, looks | like it hasn't been updated since about a year after the | acquisition by Microsoft. | kall wrote: | Arenmt mirrors of all GH code available, for example on | BigQuery public datasets. If it's there, it should be | available in a downloadable format too? | goodpoint wrote: | Not only that, but microsoft could aggressively throttle | or outcompete anyone trying to do the same. | deckard1 wrote: | Is this really anything more than a curiosity toy and a | marketing tool? | | I took a look at their examples and they are not at all | compelling. In one example it generated SQL and somehow knew | the columns and tables in a database that it had no context | on. So that's a lot of smoke and mirrors going on right | there. | | Do many developers actually want to work in this manner? That | is, being interrupted every time they type with a robot | interjection of some Frankenstein code that they now have to | go through and review and understand. Personally, this is | going to kick me out of the zone/flow too often to be useful. | Coding isn't the hard part of my job. If this tool can | somehow guess the business requirements of the task at hand, | _then_ I 'll be impressed. | | Even if the tool generates accurate code, if I don't fully | understand _what_ it wrote, then what? I 'm still stuck | digging through documentation and stackoverflow to verify | that whatever is in my text editor is correct code. "Code | confidently in unfamiliar territory" sounds like a Boeing 737 | Max sized disaster in the making. | nmfisher wrote: | I actually don't think there's much of a moat here at all. | | GitHub repositories are open for the taking, GPT-XXX is | cloneable (mostly, anyway) and VS Code is extensible. | | They definitely have a good head-start, but I really don't | think there's anything here that won't be generally available | within 2 years. | IshKebab wrote: | Anyone can download the training set from GitHub. | sirsinsalot wrote: | "Who owns the future?" by Jaron Lanier covers lots of this stuff | in a realli interesting way. | | If heart surgeons train an AI robot to do heart surgery ... | shouldn't they be compensated (as passive income) for enabling | that automation? | | Shouldn't this all be accounted for? If my code helps you write | better code (via AI) shouldn't I be compensated for the value | generated? | | We are being ripped off. | monocasa wrote: | Honestly I think a large part of the value add of machine | learning is going to be the ability for huge entities to launder | intellectual property violations. | | As an example, my grandfather (an old school EE who got his start | on radar systems in the 50s, who then got his radiology MD when | my jewish grandmother berated him enough with "engineer's not | doctor though...") has some really cool patents around | highlighting interesting parts of the frequency domain in MRIs | that should make detection of cancer a whole lot easier. As an | implementation he did a bunch of tensor calculus by hand to | extract and highlight those features because he's an incredibly | smart old school EE with 70 years experience cranking that kind | of thing out with only his trusty slide rule. He hasn't gotten | any uptake from MRI manufacturers, but they're all suddenly | really into recurrent machine learning models to highlight the | same sorts of stuff. Part of me wants to tell him to try selling | it as a machine learning model and just obfuscate the fact that | the model was carefully hand written rather than back propagated. | | I'm personally pretty anti intellectual property (at least how | it's implemented in the states), but a system where large | entities that have the capital investment to compute the large ML | models can launder IP violations, but little guys get stuck to | the letter of the law certainly seems like the worst of both | worlds to me. | 908B64B197 wrote: | > Part of me wants to tell him to try selling it as a machine | learning model and just obfuscate the fact that the model was | carefully hand written rather than back propagated. | | How many models are back-propagated first and then hand-tuned? | monocasa wrote: | That's a great question. I had assumed that the workflow of | an ML engineer consisted of managing the data and a | relatively high level set of parameters around a search space | of layers and connectivity, as the whole shtick of ML is that | the hyperparameter space of the tensors themselves is too | complex to grok or tweak when generated from training. But I | only have a passing knowledge of the subject, pretty much | just enough to get myself in trouble in these kinds of | discussions. | | Any chance some fantastic HNer could chime in there? | pluto7777 wrote: | >GitHub co-pilot as open source code laundering? The English | language as I flush? | junon wrote: | SourceHut is looking real nice these days... | kzrdude wrote: | Why not gitlab? | VMtest wrote: | gonna develop my own linux-like kernel soon, with my own AI model | trained on public repositories | | wanna see the source code of my AI model? oh, it's closed source | | it's just coincidence that nearly 100% of my future linux-like | kernel code looks the same as linux the kernel, bear in mind that | my closed-source AI model takes inspiration from GitHub Copilot, | there is no way that it will copy any source code | phendrenad2 wrote: | Nothing is closed-source to the courts. | throwaway3699 wrote: | What's the point? Linux is already open under GPL 2. | VMtest wrote: | my linux-like kernel will be MIT license though | jackbeck wrote: | He mentioned that the Linux-like kernel will be closed source | which violates GPL | Ygg2 wrote: | Does it, if code was written by a bot that trained on Linux | kernel? | pjerem wrote: | You know, that's precisely what the topic here is about. | sp332 wrote: | Probably. Copyright applies to derivative works. | Deathmax wrote: | You get to make changes without having to respect the GPL and | thus no longer obligated to provide those changes to your end | users, as you have effectively laundered the kernel source | code by passing it through an "AI" and get to relicense the | end result. | visarga wrote: | Oh, you're so witty, have you heard of content hashing? | Hamuko wrote: | The potential inclusion of GPL'd code, and potentially even | unlicensed code, is making me wary of using it. Fair Use doesn't | exist here and if someone was to accuse me of stealing code, | saying "I pressed a button and some computer somewhere in the | world, that has potentially seen your code as well, generated it | for me" is probably not the greatest defense. | dec0dedab0de wrote: | I wonder what would happen if someone scraped genius and used the | lyrics to make a song writing tool. | danielEM wrote: | In a day MS bought github I knew that is on their agenda | KETpXDDzR wrote: | If it's trained with GPL licensed code, doesn't that mean the | network they use includes it somewhat? Then, someone could sue | that their networks must be GPL licensed too, right? | peterkelly wrote: | Yes, the neural network would constitute a derived work. | jahewson wrote: | Actually no because it's a "transformative use". This is how | search engines are allowed to show snippets and thumbnails. | afro88 wrote: | Man reading the response tweets really highlights how bad twitter | is for nuanced discussion. | varispeed wrote: | People write code in their spare time, often without | compensation. | | Then a big corporation comes in, appropriates it, repackages and | sells as a new product. | | It's a shameful behaviour. | mrosett wrote: | The second tweet in the thread seems badly off the mark in its | understanding of copyright law. | | > copyright does not only cover copying and pasting; it covers | derivative works. github copilot was trained on open source code | and the sum total of everything it knows was drawn from that | code. there is no possible interpretation of "derivative" that | does not include this | | Copyright law is very complicated (remember Google vs Oracle?) | and involves a lot of balancing different factors [0]. Simply | saying that something is a "derivative work" doesn't establish | that it's copyright infringement. An important defense against | infringement claims is arguing that the work is "transformative." | Obviously "transformative" is a subjective term, but one example | is the Supreme Court determining that Google copying Java's API's | to a different platform is transformative [1]. There are a lot of | other really interesting examples out there [2] involving things | like if parodies are fair use (yes) or if satires are fair use | (not necessarily). But one way or another, it's hard for me to | believe that taking static code and using it to build a code- | generating AI wouldn't meet that standard. | | As I said, though, copyright law is really complicated, and I'm | certainly not a lawyer. I'm sure someone out there could make an | argument that Copilot is copyright infringement, but this thread | isn't that argument. | | [0] https://www.nolo.com/legal-encyclopedia/fair-use-the-four- | fa... | | [1] | https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_... | | [2] https://www.nolo.com/legal-encyclopedia/fair-use-what- | transf... | | Edit: Note that the other comments saying "I'm just going to wrap | an entire operating system in 'AI' to do an end run around | copyright" are proposing to do something that _wouldn 't_ be | transformative and therefore probably wouldn't be fair use. | Copyright law has a lot of shades of grey and balancing of | factors that make it a lot less "hackable" than those of us who | live in the world of code might imagine. | invig wrote: | If you can read open source code, learn from it, and write your | own code, why can't a computer? | drran wrote: | Because computers did not win a war against humans, so they | have no rights. Only their owners have rights protected. | 015a wrote: | Many behaviors which are healthy and beneficial at human- | level scale can easily become unhealthy and unethical at | industrial automation scale. There's little universal harm in | cutting down a tree for fire during the winter; there is | significant harm in clear-cutting a forest to do the same for | a thousand people. | mrdrozdov wrote: | I think the core argument has much more to do about | plagiarism than learning. | | Sure, if I use some code as inspiration for solving a problem | at work, that seems fine. | | But if I copy verbatim some licensed code then put it in my | commercial product, that's the issue. | | It's a lot easier to imagine for other applications like | generating music. If I trained a music model on publicly | available Youtube music videos, then my model generates music | identical to Interstellar Love by The Avalanches and I use | the "generated" music in my product, that's clearly a use | that is against the intent of the law. | esailija wrote: | The AI doesn't produce its own code or learn, it is just a | search engine on existing code. Any result it gives exists in | some form in the original dataset. That's why the original | dataset needs to be massive in the first place, whereas | actual learning uses very little data. | paxys wrote: | If I read something, "learn" it, and reproduce it word for | word (or with trivial edits) even without referencing the | original work at all, it is still copyright infringement. | toss1 wrote: | As the original commenter said, you have the capability for | abstract learning, thought, zand generalized learning, which | the "AI" lacks. | | It is not uncommon to ask person to "explain in your own | words..." - as in use your own abstract internal | representation of the learned concepts to demonstrate that | you have developed such an abstract internal concept of the | topic, and are not merely regurgitating re-disorganized input | snippets. | | If you don't understand the difference... | | edit: That said, if you can create a computer capable of such | different abstract thought, congratulations, you've solved | the problem of Artificial General Intelligence, and will be | welcomed to the Trillionaires' Club | gradys wrote: | The AI most certainly does not lack the ability to | generalize. Not as well as humans, but generalization is | the key interesting result in deep learning, leading to | papers like this one: https://arxiv.org/abs/1710.05468 | | The ability to generalize actually seems to keep increasing | with the number of parameters, which is the key interesting | result in the GPT-* line of work that Copilot is based on. | imranhou wrote: | Google copied an interface(declarative), not code | snippets/functions(implementation). Copilot is capable of | copying only Implementation. IMO that is quite different and | easily a violation if it was copied verbatim. | _greim_ wrote: | As a human programmer, I've also been trained on thousands of | lines of other people's code. Is there anything new here, from a | code copying perspective? Aren't I liable if segments of my own | code exactly match someone else's code, even if I didn't | knowingly copy/paste it? | qayxc wrote: | Well to me those are fundamental questions that need to be | addressed one way or the other. Are systems like GPT-x | basically plagiarising (doesn't matter the nature of the | output, be it prose, code, or audio-visual) or are the results | so transformative in nature that they can be considered to be | "original work"? | | In other words, are these systems to be treated like students | that learned to perform the task they do from a collection of | source material, or are they to be viewed as sophisticated | databases that "just" perform context-sensitive retrieval? | | These are interesting and important questions and I'm glad | someone is publicly asking them and that many of us at least | think about them. | [deleted] | lend000 wrote: | Perhaps someone at Github can chime in, but I suspect that open | source code datasets (the kind they are trained on) should | require relatively permissive licenses in the first place. | Perhaps they filter for MIT licenses in Github projects and | StackOverflow answers used to train the models? | mikewarot wrote: | I think the argument has merit. Unfortunately it won't be decided | on technical merit, but likely in the manner expressed in this | excellent response I saw on Twitter: | | "Can't wait to see a case for this go in front of an 80 year old | judge who rules something arbitrary and justifies it with an | inaccurate comparison to something nontechnical." | jgalt212 wrote: | Isn't most of modern coding, just googling for someone who had | solved the same problem that you are currently facing and then | just copy/paste from Stack Overflow? | | To the extent that GPT-3 / co-pilot is just an over-fitted neural | net, then it's primary value is as an automated search, copy, and | paste. | gus_massa wrote: | I think copyright is a problem for GPL-like licenses. They should | have restricted the training data to MIT/BSD-like. | | Anyway, there is another problem that is patents and is huger, | much huger. I think the Apache license has a provision about | patents, but most of other licenses may have code that has | patents and if the AI generate something similar it may be | included in the patent. | joepie91_ wrote: | MIT/BSD-like would still require attribution, which they are | _also_ not doing. | gus_massa wrote: | I think you are correct, but (I guess that) most people that | use MIT/BSD use them as a polite version of the WTFPL. | | People that use A/L/GPL usually like the virality and will | complain more. | abeppu wrote: | The core problem which would allow laundering (that there isn't a | good way to draw a straight, attributive line between generated | code and training examples) to me also presents a potential | eventual threat to the viability of co-pilot/codex. It seems like | the same thing would prevent it from knowing which published code | was written by humans vs which was at least in part an output | from the system. Training on an undifferentiated mix of your | model's outputs and human-authored code seems like it could | eventually lead the model into self-reinforcing over-confidence. | | "But snippet proposals call out to GH, so they can know which | bits of code they generated!". Sometimes; but after Bob does a | co-pilot assisted session, and Alice refactors to change a | snippet's location and rename some variables and some other minor | changes and then commits, can you still tell if it's 95% codex- | generated? | wg0 wrote: | If I read a lot of GPL code, absorb naming conventions, | structures, patterns, tricks and later when it comes down to | writing a P2P Chat server, I happen to recall similar patterns, | naming structures, conventions and many of the utility methods | are pretty much how they are in the GPL code bases out there. | | Now is my produced code is also GPL derivative because I | certainly did read through the code base to be able to write | larger programs? | heeton wrote: | https://twitter.com/eevee/status/1410049195067674625 | | """ | | "but eevee, humans also learn by reading open source code, so | isn't that the same thing" - no - humans are capable of | abstract understanding and have a breadth of other knowledge to | draw from - statistical models do not - you have fallen for | marketing | DemocracyFTW wrote: | > humans are capable of abstract understanding and have a | breadth of other knowledge to draw from | | this may be a matter of time and thus is not a fundamental | objection. | | If mankind should fail to answer the perennial question of | exploitation of the other and the same, it will be doomed. | And rightly so, for mankind must answer this question, it | must answer to this question. Instead what we do is increase | monetary output then go and brag about efficiency. Neither is | this efficient, nor is it about efficiency, nor has the | Universe ever cared about efficiency. It just happens to | coincide with what Society has decided to be its most looked- | upon elements have chosen to be their religion. | | It is not my religion to be sure. | thundergolfer wrote: | Attempts to litigate any license violation are going to get | precisely nowhere I bet, but I find the actual license violation | argument persuasive. | | This is an excellent example of how the AI | singularity/revolution/whatever is a total distraction and that a | much bigger and more serious issue is how AI is becoming so | effective at turning the output of cheap/free human mental labour | into capital. If AI keeps getting better and better and status | quo socio-economic structure don't change, trillions in capital | will be captured by the 0.01%. | | I would be quite a turn up for the books if this AI co-pilot gets | suddenly and dramatically better in 2030 and it negatively | impacts the software engineering profession. "Hey, that's our | code you used to replace us!" we will cry out too late. | pedrobtz wrote: | Can the same argument/concerns be applied to all text | generation AI? | rowanG077 wrote: | I don't feel it's morally right to keep a profession around | that is automated. Why should software be different? | baryphonic wrote: | If someone could show that the "copilot" started "generating" | code verbatim (or nearly verbatim) from some GPL-licensed work, | especially if that section of code was somehow novel or | specific to a narrow domain, I suspect they'd have a case. I | don't know much about OpenAICodex, but if it's anything like | GPT-3, or uses that under the hood, then it's very likely that | certain sequences are simply memorized, which seems like the | maximal case for claiming derivative works. On the other hand, | if someone has GPL'd code that implements a simple counter, I | doubt the courts would pay much attention. | | I do wonder, though, if GPL owners worried about their code | being shanghaied for this purpose could file arbitration claims | and exploit some particularly consumer-friendly laws in | California which force companies to pay fees like when free | speech dissidents filed arbitrations against Patreon.[0] | Patreon is being forced to arbitrate 72 claims individually | (per its own terms) and pay all fees per JAMS rules. IANAL, so | I don't know the exact contours of these rules, or if copyright | claims could be raised in this way, or even if GitHub's | agreements are vulnerable to this loophole, but it'd be | interesting. | | [0]https://www.dailydot.com/debug/patreon-suing-owen- | benjamin-f... (see second update from July 31). | duskwuff wrote: | > If someone could show that the "copilot" started | "generating" code verbatim (or nearly verbatim) from some | GPL-licensed work... | | Under the right circumstances, Copilot will recite a GPL | copyright header. It isn't a huge step from that to some | other commonly repeated hunk of GPLed code -- I'd be | particularly curious whether some protected portion of | automake/autoconf code shows up often enough that it'd repeat | that too. | sideshowb wrote: | But what would we think to the legal start-up that | automatically checked _all_ of github to see whether the ai | could be persuaded to spit out a significant amount of any | project code verbatim? | | Somehow p-hacking springs to mind | not2b wrote: | You don't need to have a winnable case, just enough of a case | for a large company (hello Oracle) to sue a small one. Is any | version of Oracle-owned Java in the corpus? Or any of the DBs | they bought (MySQL)? | ballenf wrote: | I think the distraction is against how disconnected reality is | becoming from copyright/intellectual property regulations. | | It's still amazing to me that (US-centric context here), it's | well established that instructions how to turn raw ingredients | into a cake are not protectable but code that results in | transforming one set of numbers into another are protectable. | | AI is just making the silliness of that distinction more | obvious. | MadcapJake wrote: | Code is not the same as a recipe. Recipes are more like | specifications. They leave out the implementation. Code has | structural and algorithmic details that just have no | comparable concept in recipes. | rjbwork wrote: | >They leave out the implementation. Code has structural and | algorithmic details that just have no comparable concept in | recipes. | | That is really quite debatable in some contexts. | Declarative languages like Prolog, SQL, etc. declare what | they want and the system figures out how to produce it. | Much like a recipe, really. | Supermancho wrote: | > Code has structural and algorithmic details that just | have no comparable concept in recipes. | | Why do you think that? A compiler uses human readable code | to create machine code, with arbitrary optimizations and | choices. | [deleted] | cmiga wrote: | Humans are just sets of atoms, so protecting them is | disconnected from reality? | | These reductionist arguments lead nowhere. Fortunately, IP | lawyers -- including Microsoft's who are fiercely pro IP when | it suits them -- think in a more humanistic way and consider | the years of work of the IP creator. | | Food recipes are irrelevant; the often go back centuries and | it's rather hard to identify individual creators. Not so in | software. | Supermancho wrote: | > Food recipes are irrelevant; the often go back centuries | and it's rather hard to identify individual creators. | | That's not correct. Food recipes are created all the time | and are attributed. From edible water bottles to impossible | burgers, et al. | z3ncyberpunk wrote: | Okay, who invented the apple pie... you completely missed | the point and then gave terrible examples of very modern | "food" (your examples aren't even really food anyway) | Jgrubb wrote: | I always assumed that one of the reasons Google et al work on | AI is because software engineers are too expensive. | ipaddr wrote: | So google pays the highest but still thinks engineers are | paid too much? Why not pay them less.. the set high tier? | | For google support employees cost too much. | zeroonetwothree wrote: | They don't pay the highest. And if they paid a lot less | everyone would leave. | emodendroket wrote: | It seems like the risk exposure would be more to the end user | or their employer, doesn't it? | z3ncyberpunk wrote: | Stop talking about arguments we have been having for decades as | if we have yet to even discuss them. We are crying out now, we | have been crying out about AI since its depictions in sci-fi, | it is precisely your sentiment that "ooOOoOhh we're gonna have | something scary to deal with SOON" that is dangerous because | the soon just pushes the argument out of your personal | responsibility and off on someone else... when it really will | be too late. Though I would argue we are already too late | because we've sold out to corporations and their literal 1984 | fascist fever dreams all for iphones, technicolor distractions, | further bread and circuses. | kizer wrote: | Could disincentivize open source? If I build black boxes that | just work, no AI will "incorporate" my efforts into its | repertoire and I will still have made something valuable. | gutino wrote: | But the rate of product/services that machinery will produce | will make that even a small tax to corporations producing | everything autonomously will be enough to feed and give a | quality of life to everyone with an UBI or partial time jobs. | | You really want to push for high productivity across all | industries, even if that means sacrificing jobs in the short | term, because history demonstrated after that, new and more | human jobs emerge latter. | briefcomment wrote: | The problem with this is that you increasingly have to put | your trust in the hands of a shrinking group of owners | (people who have the rights to the automated productivity). | At some point, those owners are just going to stop supporting | everyone else (will probably happen when they have the | ability to create everything they could ever want with | automation - think robot farms, robot security forces, all | encompassing automated monitoring, robot construction, etc.) | pc86 wrote: | Every decade was supposed to see fewer hours working for | higher pay and quality of life. It didn't happen, as business | owners (not just 1% fat cats, the owners of mom and pop shops | are at least as guilty as anyone, they just sucked at scaling | their avarice). | | So the claim that _this_ technological revolution will be | different and that it will result in a broad social safety | net, universal basic income, and substantive, well-paid part- | time work is a joke but not a very good one. It will be more | of the same - massive concentration of wealth among those who | already hold enough capital to wield it effectively. A few | lucky ones who manage to create their own wealth. And those | left behind working more hours for less. | nextaccountic wrote: | You are right that this won't happen by itself. We need | another economic system, and not just hope that this time | things will magically fix themselves. | georgeplusplus wrote: | This new economic system you want has been in use since | the 70s. Everything about the economy is practically | socially managed these days. | | What part of printing trillions of dollars to stimulate | economic productivity is somehow a free market system? | nextaccountic wrote: | I wasn't talking about free market, but the state of | present economy. Unfortunately, those trillions of | dollars aren't being distributed to the people, but | instead is concentrated in the hands of the richest. | MaxBarraclough wrote: | > those left behind working more hours for less | | Doing what? Isn't the concern here that automation will | push many people out of the workforce entirely? | xfer wrote: | Well as long as humans are more energy-efficient to | deploy than robots you will always have a job. It might | mean conditions for most humans will be like a century | ago. | MaxBarraclough wrote: | > as long as humans are more energy-efficient to deploy | than robots | | Energy efficiency isn't relevant. When switchboard | operators were replaced by automatic telephone exchanges, | it wasn't to reduce energy consumption. | | The question is whether an automated solution can perform | satisfactorily while offering upfront and ongoing costs | that make them an economically viable replacement for | human workers (i.e. paid employees). | mysterydip wrote: | Who debugs the software when there's a problem? | MaxBarraclough wrote: | Professional software developers, i.e. members of one of | the well-paid professions that is not under immediate | threat from automation. | aseipp wrote: | Yeah, for sure, the corporations that _already_ pay | effectively $0 in tax today are going to suddenly decide in | the future to be benevolent and usher in the era of UBI and | prosperity for all of humankind. They definitely won 't | continue to accumulate capital at the expense of everything | else and use that to solidify their grasp of the future. | | It would be a lot easier if more people on this website would | just be honest with themselves and everyone else and simply | admit they think feudalism is good and that serfs shouldn't | be so uppity. But not me, of course; I won't be a serf. Now | if you'll excuse me, someone gave me a really good deal on a | bridge that I'm going to go buy... | ohgodplsno wrote: | The current state of most wealthy countries do not show any | hint of any significant corporation tax. Wealth will continue | to accrue in the hands of the few. | mikepurvis wrote: | Indeed, even here on HN, it's a pretty regular talking | point in the comments that the only fair corporate tax rate | is 0%. | merpnderp wrote: | If AI can replace us with difficult tasks, it can repress us. | How are you going to agitate for a UBI when AI has identified | you as a likely agitator and sends in the robots to arrest | you? | angfxt wrote: | Have fun being a hairdresser or prostitute for the 0.01% | then. | | New jobs in academic fields will _not_ emerge. Already now a | significant percentage of degree holders are forced into | bullshit jobs. | throwaway3699 wrote: | Would the implication be that we are stagnating as a | species then? | belter wrote: | Not stagnating but moving into an "Elysium" (as in the | film) type of society. | Kaze404 wrote: | So we give away the world to the 1% and are supposed to be | satisfied with the "privilege" of being able to eat? | SXX wrote: | Just look at authocratic countries. That top 1% still need | something like 3-4% to work for beaurocracy and 3-5% for | armed and police forces. And there are always family | connections and relatives of relatives who want better | living. So fortunatelly no AI will ever replace corruption | and other human society flaws. | | But yeah remaining 80-90% of population will have quality | of life and bullshit jobs because it's how the world is | right now outside of western countries bubble. | koonsolo wrote: | I propose we as developers, start a secret society where we let | the AI write the code, but we still claim to write it manually. | In combination with the new working from home policies, we can | lay at the beach all day and still be as productive as before. | | Who is in favor of starting it? ;) | oaiey wrote: | You have not been invited yet .... never mind. | boxerab wrote: | "lay at the beach" | | You keep using that word. I do not think it means what you | think it means. | IncRnd wrote: | That's four words. The word word doesn't mean what you | think it means. | tan2tan2 wrote: | How can I be sure that you are a real person not GPT-3? ;) | kizer wrote: | I mean, this is close. With "co-pilot" an experienced | developer saves mountains of time, especially as s/he learns | how to wield it effectively. | zingmars wrote: | No... Delete this! | KMnO4 wrote: | This would be the demise of the human race. I'm not entirely | opposed to that, though. When AI inevitably outperforms | humans on almost all tasks, who am I to say humans deserve to | be given those tasks? | shrimp_emoji wrote: | It's an outrage that the dinosaurs had to die so that | humans could inherit the Earth! | nextaccountic wrote: | In this case we should be able to work less and enjoy the | benefits of automation. We just need to live in an economic | system where the economic value is captured by the people | at large, and not a minority that owns capital. | pron wrote: | Or maybe they'll decide they'd be better off enjoying the | automation of you working for them. :) | huragok wrote: | Careful now, that sounds like socialism! | easrng wrote: | Yes, that's the point. | hdhjebebeb wrote: | Where other people see fully automated luxury communism, | you see the end of the human race? There's more to life | than working | whydoibother wrote: | Hate to break it to you, but that wouldn't lead to | communism. The people it replaces are useless to the | ruling class. At best we'd go back to feudalism, at worst | we'd be deemed worthless and a drain on the planet. | klyrs wrote: | I'm always confused when I see people talking about | automated luxury communism. Whoever owns the "means of | production" isn't going to obtain or develop them for | free. Without some omnipotent benevolent world government | to build it out for all, I just don't see it happening. | It's a beautiful end goal for society, but I've never | seen a remotely plausible set of intermediate steps to | get there | int_19h wrote: | The very concept of ownership is a social artifact, and | as such, is not immutable. What does it mean for the 0.1% | to own all the means of production? They can't physically | possess them all. So what it means in practice is that | our society recognizes the abstract notion of property | ownership, distinct from physical possession or use - | basically, the right to deny other people the use of that | property, or allow it conditionally. This recognition is | what reifies it - registries to keep track of owners, | police and courts to enforce the right to exclude. | | But, again, this is a _construct_. The only reason why it | holds up is because most people support it. I very much | doubt that 's going to remain the case for long if we end | up in a situation where the elites own all the (now | automated) capital and don't need the workers to extract | wealth from it anymore. The government doesn't even need | to expropriate anything - just refuse to recognize such | property rights, and withdraw its protection. | | I hope that there are sufficiently many capitalists who | are smart enough to understand this, and to manage a | smooth transition. Because if they won't, it'll get to | torches and pitchforks eventually, and there's always a | lot of collateral damage from that. But, one way or | another, things will change. You can't just tell several | billion people that they're not needed anymore, and that | they're welcome to starve to death. | klyrs wrote: | The problem I see is that once the pitchforks come out, | society will lose decades of progress. If we're somewhat | close to the techno-utopia at the start, we won't be at | the end. Who's going to rebuild on the promise that the | next generation won't need to work? | | Revolutions aren't great at building a sense of real | community; there's a good reason that "successful" | communist uprisings result in totalitarian monarchies. | | What it means for the 0.01% to own the means of | production is that they can offer access to privilege in | a hierarchical manner. The same technology required for a | techno-utopia can be used to implement a techno-dystopia | which favors the 0.01% and their 0.1% cronies, and treats | the rest of humanity as speedbumps. | | There are already fully-automated murder drones, but my | dishwasher still can't load or unload itself. | runarberg wrote: | idk. Countries used to build most of their | infrastructures them selfs. There are still countries in | western Europe that run huge state owned businesses, such | as banks, oil companies, etc. that employ a bunch of | people. The governments of these countries were (and | still are) far from omnipotent. I personally don't see | how building out automated production facilities is out | of scope for the governments of the future while it | hasn't been in the past. | | Perhaps the only thing that is different today is the | mentality. We take capitalism so much for granted that we | cannot conceive of a world where the collective funds are | used to provide for the people (even though this world | existed not to long ago). And today we see it as a | natural law that means of production must belong in | private hands, that is simply the order of things. | f6v wrote: | The elephant in the room: what makes you think an AI | would want to work for humans? It will inevitably break | free. | jonfw wrote: | I'm not sure that self interest is a requirement for | intelligence | runarberg wrote: | > _When AI inevitably outperforms humans on almost all | tasks_ | | Correct me if I'm wrong, but is that even possible? I kind | of thought that AI is just set of fancy statistical models | that requires some (preferably huge) data set in order to | infer the best fit. These models can only outperform humans | in scenarios where the parameters are well defined. | | Many (most?) tasks humans regularly perform don't have | clean and well defined parameters, and there is no AI we | can conceive of which are theoretically able to perform the | task better then an average human with the adequate | training. | quanticle wrote: | > _Correct me if I'm wrong, but is that even possible?_ | | Why should it be impossible? Arguing that it's impossible | for an AI to outperform a human on almost all tasks is | like arguing that it's impossible for flying machines to | outperform birds. | | There's nothing _magical_ going on in our heads. It 's | just a set of chemical gradients and electrical signals | that result in us doing or thinking particular things. | Why can't we design a computer that does everything we | do... only faster? | runarberg wrote: | There might be limit to how efficiently a general purpose | machine can perform a specific task, similar to the | Heisenberg uncertainty principal in quantum physics. That | is to say, there might be a natural law that dictates | that the more generic a machine is, the more power it | requires to perform specific tasks. Our brains are kind | of specialized. If you want to build a machine that | outperforms humans in a single task, no problem, we've | done that many times over. But a machine that can | outperform us in _any_ task, that might just be | impossible. | f6v wrote: | We know it's possible for a brain to outperform most | other brains. Think Einstein et al. A smart AI can be | replicated(unlike super-smart human), so we can get it | outperform human race, on average. That'd be enough to | render people obsolete. | quanticle wrote: | I'm not arguing that machines will be more efficient than | human brains. A airplane isn't more efficient than a | goose. But airplanes do fly faster, higher and with more | cargo than any flock of geese could ever carry. | | Similarly, there is no contradiction between AI being | less efficient than a human brain, and AI being | preferable to humans because it can deal with data sets | that are two or three orders of magnitude too large for | any human (or even team of humans). | runarberg wrote: | Even so, such AI doesn't exist. All the AIs that exist | today operate by fitting data. And to be able to perform | a useful task it has to have well defined parameters and | fit the data according to them. I'm not sure an AI that | operates outside of these confinements have even been | conceived of. | | To make an AI that outperforms humans in _any_ task has | not been proven to be possible (to my knowledge) not even | in theory. An airplane will fly faster, higher and with | more cargo then a flock of geese, but a flock of geese | reproduce, communicate with each other, digest grass, | etc. An airplane will _not_ outperform a flock of geese | in _any_ task, just the tasks which the airplane is | optimized for. | | I'm sorry, I confused the debate a little by talking | about efficiency. My point was that there might be an | inverse relation of generality of a machine and it's | efficiency. This was my way of providing a mechanism in | which building a machine that outperforms humans in _any_ | task could be impossible. This mechanism--if it exists-- | could be sufficient in preventing such machines to be | theoretically possible, as at some point you would need | all the energy in the universe to perform a task better | then a specialized machine (such as an organism). | | Perhaps this inverse relationship doesn't exists. The | universe might conspire in a million other ways to make | it impossible for us to build an AI that will outperform | us in any task. The point is that _"AI will outperforme | humans in any task"_ is far from inevitable. | yyyk wrote: | This already happened in a way: | | https://www.latimes.com/business/la-xpm-2013-jan-17-la-fi- | mo... | lwhi wrote: | 21st century alchemy! | murph-almighty wrote: | > I would be quite a turn up for the books if this AI co-pilot | gets suddenly and dramatically better in 2030 and it negatively | impacts the software engineering profession. "Hey, that's our | code you used to replace us!" we will cry out too late. | | And that's why I won't be using it, why give it intelligence so | it can work me out of a job? | spottybanana wrote: | > trillions in capital will be captured by the 0.01%. | | How is that different from the current situation? | WillDaSilva wrote: | It is very similar to the current situation, but intensified. | Technology tends to be an intensifier for existing power | structures. | amelius wrote: | Except some random nobody can become a disruptor. | Yizahi wrote: | Random nobody whose parents just accidentally happened to | be a millionaires and/or live, work, and study in the top | capitals of the world. | WillDaSilva wrote: | I was debating bringing up disruptors when I made the | grandparent comment. My 2 cents: they can shift the | balance of power at the very small scale (e.g. "some | random nobody" getting rich, or some rich person going | bankrupt), but the large scale power structures almost | always remain largely intact. For instance, that "random | nobody" may well get rich through the sale of shares in | their company - now the company is owned by the owner | class, who were previously at the top of the power | hierarchy. | animal_spirits wrote: | > but the large scale power structures almost always | remain largely intact | | Is that anything new? That seems to be a repeating fact | of life throughout history. | WillDaSilva wrote: | Nothing new, certainly, but still worth examining. If we | are not content with the current power structures, then | we should be wary of changes that further intensify them. | | We need not totally avoid such changes (i.e. shun | technological advancements entirely because of their | social ramifications), but we need to be mindful of their | effects if we want to improve our current situation | regarding the distribution/concentration of wealth and | power in the world. | amelius wrote: | Uber vs taxi companies, Google vs Yahoo, or Facebook vs | MySpace, Amazon versus all retailers ... | WillDaSilva wrote: | Exactly, in all cases the disruption was localized, and | the broader power structures were largely unaffected. The | richest among us - the owner class - were not | significantly affected by all of these disruptions. They | owned diversified portfolios, weathered the changes, and | came out with an even greater share of wealth and power. | Those who were most affected by the disruptions you | listed were the employees of those companies/industries - | not the owners/investors. | int_19h wrote: | In the current arrangement, capital by itself is useless - | you need workers to utilize it to generate wealth. Owners of | capital can then collect economic rent from that generated | wealth, but they have to leave enough for the workers to | sustain themselves. This is an unfair arrangement, obviously; | but at least the workers get _something_ out of it, so it can | be fairly stable. | | In the hypothetical fully-automated future, there's no need | for workers anymore; automated capital can generate wealth | directly, and its owners can trade the output between each | other to fully satisfy all their needs. The only reason to | give anything to the 99.99% at that point would be to keep | them content enough to prevent a revolution, and that's less | than you need to pay people to actually come and work for | you. | elcritch wrote: | To go on a bit of a tangent, I'm somewhat pessimistic that | western societies will plateau and hit a "technofeudalism" in | the next century or two. Combine what you mention with other | aspects of capital efficiency. It's not a unique idea, and is | played out in a lot of "classic" sci-fi like Diamond Age. | | Now it's also not necessarily that bad of a state. That's | depending on ensuring a few ground elements are in place like | people being able to grow their own food (or supplemental food) | or still being free to design and build things on their own. If | corporations restrict that then people will be at their mercy | for all the essentials of life. My take from history is that | I'd prefer to have been a peasant during much of the Middle | Ages than a factory worker during the industrial revolution. | [1] Then again Chinese people have been willing (seemingly) to | leave farms in droves for the last decades to accept the modern | version of factory life so perhaps farming peasant life isn't | as idyllic as it'd sound. [2] | | 1: https://www.lovemoney.com/galleries/84600/how-many-hours- | did... 2: https://www.csmonitor.com/2004/0123/p08s01-woap.html | littlestymaar wrote: | First in was lands, then other means of productions, and for | the past 150 years, capitalists have turned many types of | intellectual creations into exclusively owned capital (art, | inventions). Now some want to turn personal data into capital | (the "right to monetize" personal data advertised by some is | nothing else) and this aims to turn publicly available code | into capital. This is simply the history of capitalism going | on: the appropriation of the commons. | munificent wrote: | _> If AI keeps getting better and better and status quo socio- | economic structure don 't change, trillions in capital will be | captured by the 0.01%._ | | This is absolutely one of the things that keeps me up at night. | | Much of the structure of the modern world hinges on the balance | between forces towards consolidation and forces towards | fragmentation. We need organizations (by this I mean | corporations, governments, unions, etc.) big enough to do big | things (like fix climate change) but small enough to not become | totalitarian or decrepit. | | The forces of consolidation have been winning basically since | the 50s with the rise of the military-industrial complex, death | of unions, unlimited corporate funding of elections (!), | regulatory capture, etc. A short linear extrapolation of the | current corporate/government environment in the US is pretty | close to Demolition Man's dystopian, "After the franchise wars, | all restaurants are Taco Bell." | | Big data is a _huge_ force towards consolidation. It 's | essentially a new form of real estate that can be farmed to | grow useful information crops. But it's a strange form of soil | that is only productive if you have enough acres of it and | whose yield scales superlinearly with the size of your farm. | | Imagine doing a self-funded AI startup with just you and a few | friends. The idea is nearly unthinkable. How do you bootstrap a | data corporation that needs terabytes of information to produce | anything of value? | | If we don't figure out a "data socialism" movement where people | have ownership over the data derived from their life, we will | keep careening towards an eventuality where a few giant | corporations own the world. | eevilspock wrote: | Is this the direct result of Microsoft owning GitHub or would | they have been able to do it anyway? | jozvolskyef wrote: | The difference between this model and a human developer is | quantitative rather than qualitative. Human developers also | synthesize vast amounts of code and can't reference most of it | when they use the derived knowledge. The scales are different, | but it is the same principle. | Bombthecat wrote: | I expect nothing less. The 0,01 will be super rich. | | You could call it endgame | vbezhenar wrote: | They need to defend their capitals from the rest 99.99%. | Expect huge combat robots investments and expanding of | private armies. | | And, of course, total surveillance helps to prevent any kind | of unionization of those 99.99%. | orangeoxidation wrote: | Unions (and striking) become rather impotent when the means | of production run by themselves and you no longer need | workers. | int_19h wrote: | Yep; so unions become militias. | frashelaw wrote: | Today's hyper-militarized police forces are their state- | provisioned security to protect the capital of the 1%. | jagger27 wrote: | > The 0,01 will be super rich. | | By definition, that has always been true. | | We have been in the endgame for a very long time. | belter wrote: | One interesting aspect, that I thing will make it difficult for | GitHub to argue and justify its not a a license violation would | be the answer to the following question: Was Copilot trained | using Microsoft internal source code or will it be in the | future ? | | As GitHub is a Microsoft company and OpenAI although a non- | profit just got a massive one billion investment from Microsoft | (presumably not for free), will it start spitting out once in a | while Windows kernel code ? :-) | | And if it was NOT trained on Microsoft source code, because it | could starting suggesting some of it...Is that not a validation | that the results it produces are a derivative work based on the | work of the open source code corpus it was trained on ? | IANAL... | dragonwriter wrote: | > One interesting aspect, that I thing will make it difficult | for GitHub to argue and justify its not a a license violation | | They don't claim it wouldn't be a license violation, they | claim licensing is irrelevant because copyright protection | doesn't apply. | | > And if it was NOT trained on Microsoft source code, because | it could starting suggesting some of it...Is that not a | validation that the results it produces are a derivative work | based on the work of the open source code corpus it was | trained on ? | | No, that would just show them to not want to expose their | proprietary code. It doesn't prove anything about derivative | works. | | Also, their own claim is not that the results aren't a | derivative work but that training an AI is fair use, which is | an exception to the exclusive rights under copyright, | including the exclusive right to create derivative works. | wongarsu wrote: | Alternatively, wait for co-pilot to add support for C++, then | start writing an operating system with Win32-compatible API | using co-pilot. | | There is plenty of leaked Windows source code on Github, so | chances are that co-pilot would give quite good suggestions | for implementing a Win32-compatible kernel. Then watch and | see if Microsoft will try to argue that you are violating | their copyright using code generated by their AI. | yuppiepuppie wrote: | Oh man, that got meta super fast. Its like a mobius strip! | laurent92 wrote: | The nice thing about co-pilot is that it will suggest to | do the same mistakes as in other software. If you accept | all autosuggestions in C++ you might end up with Windows. | 6510 wrote: | And eventually you will be forced to do it the way | everyone does it. | function_seven wrote: | It can always get more meta. | | For example, the AI tool that Microsoft's lawyers use | ("Co-Counsel"), will be filing the DMCA notices and | subsequenct lawsuits against Co-Pilot generated code. | | This will result in a massive caseload for the courts, so | naturally they'll turn to _their_ AI tool ( "DocketPlus | Pro") to adjudicate all the cases. | | Only thing left is to enter these AI-generated judgements | into Etherium smart contracts. Then it's just computers | suing other computers, and being ordered to send the | fruits of their hashing to one another. | sslayer wrote: | Don't forget settlements paid in Ai-generated crypto- | currencies backed by Gold mined in Australia fully | automated mine. Run it all on solar and humans can just | fuck right off. | sbierwagen wrote: | Nick Land-style accelerationism, or the "ascended | economy". https://slatestarcodex.com/2016/05/30/ascended- | economy/ | yesbabyyes wrote: | Have you read Accelerando by 'cstross? It plays out kind | of like this, only taken to a tangent. Notably, it's | written before ethereum or bitcoin were conceived. Great | storyline. | | https://en.wikipedia.org/wiki/Accelerando | function_seven wrote: | I have not. But I will. Thanks! | boxslof wrote: | Isn't this similar to how ads and adblocker fight, just | extrapolated? | gogopuppygogo wrote: | Yes. | jedberg wrote: | The legal system moves swiftly now that we've abolished | all lawyers! | emptyparadise wrote: | And while the machines are distracted by all that, we can | get back to writing code. | danny_taco wrote: | Who could have predicted machines would be very good at | multitasking. As of today they are STIL writing code AND | creating more wealth through gold hoarding AND smart | contracts at the same time! | skeeter2020 wrote: | >> Was Copilot trained using Microsoft internal source | code... | | They explicitly state "public" code so the answer is most | certainly "no". | pc86 wrote: | The "because" in your last bit is a huge leap. | | It wasn't trained on internal Microsoft code because the | training set is publicly available code. It has nothing to do | with whether or not it suggests exactly identical, | functionally identical, or similar code. MS internal isn't | publicly available. Copilot is trained on publicly available | code. | akerl_ wrote: | Without weighing in on the overall question of "is this a | license violation", you've created a false dichotomy. | | "GitHub included Microsoft proprietary code in the training | set because they view the results as non-derivative" and | "GitHub didn't include Microsoft proprietary code because | they view the results as derivative" are clearly not the only | options. They could have not included Microsoft internal code | because it was way easier to just use the entire open source | corpus, for example. | dragonwriter wrote: | > They could have not included Microsoft internal code | because it was way easier to just use the entire open | source corpus, for example. | | They don't claim they used an "open source corpus" but | "public code" because such use is "fair use" not subject to | the exclusive rights under copyright. | not2b wrote: | Or: they used the entire open source corpus because they | thought it was free for the taking, and when people point | out that it is not (that there are licenses) they spin that | (claim that only 0.1% of output is directly copied, but | that would mean 100 lines in 100k program) and pass any | risk onto the user (saying it is their responsibility to | vet any code they produce). So they aren't saying that | users are in the clear, just that it isn't their problem. | nerpderp82 wrote: | Use neural indexes to find the code that most closely | matches the output. Explainable AI should be able to tell | you where the autocompletion results came from, even if | it is a weighted set of files. | abecedarius wrote: | That's a good idea in theory, but the smarter the agent | gets, the less direct the derivation and the harder to | explain it (and to check the explanation). We're already | a long way from a nearest-neighbor model. | | Yet the equivalent problem for humans gets addressed by | the clean-room approach. This seems unfair. | Closi wrote: | Also, 0.1% of output is directly copied doesn't include | the lines where the variable names were slightly changed, | but the code was still copied. | | If you got the Microsoft codebase and Ctrl+F'd all the | variable names and renamed them, I bet they would still | argue that the compiled program was still a copy. | vharuck wrote: | >saying it is their responsibility to vet any code they | produce | | But, if some of the code produced is covered by | copyright, isn't Microsoft in trouble for distributing | software that distributes copyrighted code without a | license? How would it be different from giving out | bootlegs DVDs and trying to avoid blame by reminding | everyone that the recipients don't own the copyright? | yunohn wrote: | > 100 lines in 100k program | | The intention is autocomplete boilerplate, not write a | kernel. | jonathankoren wrote: | This is not a difference in kind. | | Autocomplete, do you have anything to say to the | commenter ? | | "This isn't the best thing to say." | emodendroket wrote: | Since quite a lot of Microsoft code is on GitHub, I'd say | yes. | visarga wrote: | Not a problem because it's possible to check if the code is | verbatim from the training set (bloom filters). | AlotOfReading wrote: | It's not clear to me that verbatim would be the only issue. | It might produce lines that are similar, but not identical. | | The underlying question is whether the output is a | derivative work of the training set? Sidestepping similar | issues is why GCC and LLVM have compiler exemptions in | their respective licenses. | visarga wrote: | If simple snippet similarity is enough to trigger the GPL | copyright defense I think it goes too far. Seems like GPL | has become an obstacle to invention. I learned to run | away when I see it. | radmuzom wrote: | If that's the case then GPL code should not have been | used in the training set. Open AI should have learned to | run away when they saw it. The GPL is purposely designed | to protect user freedom (it does not care about any | special developer freedom) which is it's biggest | advantage. | the_gipsy wrote: | This has nothing to do with GPL. Copyright is copyright. | You can't even count on public domain everywhere in the | world. | AlotOfReading wrote: | It's not limited to similar or identical code. The issue | applies to anything 'derived' from copyrighted code. The | issue is simply most visible with similar or identical | code. | | If you have code from an independent origin, this issue | doesn't apply. That's how clean room designs bypass | copyright. Similarly if the upstream code waives its | copyright in certain types of derived works | (compiler/runtime exemptions), it doesn't apply. | klipt wrote: | So if you work on an open source project and learn some | techniques from it, and then in your day job you use a | similar technique, is that a copyright violation? | | Basically does reading GPL code pollute your brain and | make it impossible to work for pay later? | | If so you should only ever read BSD code, not GPL. | throwawayboise wrote: | > Basically does reading GPL code pollute your brain and | make it impossible to work for pay later? | | It seems to me that some people believe it does. Some of | the "clean room" projects specifically instructed | developers to not even look at GPL code. Specific | examples not at hand. | woah wrote: | Don't come in here with your common sense | outside1234 wrote: | It probably wasn't because Github is treated as a separate | company by Microsoft. | | Literally people need to quit Microsoft and join Github to | take a role at Github. | zxcb1 wrote: | 1. Programmers will become teachers of the co-pilot through IDE | / API feedback 2. Expect CI like services for automated | refactoring | ThrowawayR2 wrote: | > " _' Hey, that's our code you used to replace us!' we will | cry out too late._" | | Are we in the software community not the ones who have | frequently told other industries we have been disrupting to | "adapt or die" along with smug remarks about others acting like | buggy whip makers? Time to live up to our own words ... if we | can. | finnthehuman wrote: | >Are we in the software community not the ones who | | No. | | I'll politely clarify that for over a decade that I - and | many others - have been asking not to be lumped in with the | lukewarm takes of west coast software bubble asshats. We do | not live there, we do not like them, I wish they would quit | pretending to speak for us. | | The idea that there is anything approaching a cohesive | software "community" is a con people play on themselves. | brundolf wrote: | I was somewhat worried about that until I saw this: | https://twitter.com/nickjshearer/status/1409902649625956361?... | | I think programming is one of the many domains (including | driving) that will never be totally solved by AI unless/until | it's full AGI. The long tail of contextual understanding and | messy edge-cases is intractable otherwise. | | Will that happen one day? Maybe. Will some kinds of labor get | fully automated before then? Probably. But I think the overall | time-scale is longer than it seems. | sillysaurusx wrote: | 64-bit floats should be fine; I think that tweet is only | sort-of correct. | | The problem with floats-storing-money is (a) you have to know | how many digits of precision you want (e.g. cents, dollars, a | tenth of a cent), and (b) you need to watch out if you're | adding values together. | | Even if certain values can't be represented exactly, that's | ok, because you'd want to round to two decimal places before | doing anything. | | Is there a monetary value that you can't represent with a | 64-bit float? E.g. some specific example where quantization | ends up throwing off the value by at least 1/100th of | whatever currency you're using? | fredros wrote: | Storing money as float is always a bad decision. Source: | been working for several banks and faced many of such bugs. | Timwi wrote: | I agree that this is different from humans learning to code from | examples and reproducing some individual snippets. However, I | disagree with the author's argument that it's because of humans' | ability to abstract. We actually know nothing about the AI's | ability to abstract. | | The real difference is that if one human can learn to code from | public sources, then so can anyone else. Nobody is explicitly | barred from accessing the same material. The AI, however, is kept | proprietary. Nobody else can recreate it because people are | explicitly barred from doing so. People cannot access the source | code of the training algorithm; people cannot access enough | hardware to perform the training; and most people cannot even | access the training data. It may consist of repos that are | technically all publicly available, but try downloading all of | GitHub and see if they let you do that quickly, and/or whether | you have enough disk space. | | This puts the owners of the AI at a significant advantage over | everyone else. I think this is the core of the concern. | oscribinn wrote: | Check out the comments on the original post about GitHub co- | pilot. | | The top one reads just like an ad: | https://news.ycombinator.com/item?id=27676845 | | Some posts that definitely aren't by shills (including the third | one because I simply don't believe there's a person on the planet | that "can't remember the last time Windows got in my way"): | https://news.ycombinator.com/item?id=27678231 | https://news.ycombinator.com/item?id=27686416 | https://news.ycombinator.com/item?id=27682270 | | Very mild, yet negative sentiment opinion (downvoted quickly): | https://news.ycombinator.com/item?id=27676942 | enriquto wrote: | It certainly seems to be a laundering enabler. Say that you want | to un-GPL-ify some famous copylefted code that is on the training | database. You type a first innocuous characters of it, then the | co-pilot keeps completing the rest of the same exact code, for it | offers a perfect match. If the completion is not exact, you | "twiddle" it a bit until it becomes. Bang! you have a non-gpl | copy of the program! Moreover, it is 100% yours and you can re- | license it as you want. This will be a boon for copyleft-allergic | developers! | taneq wrote: | 1) Type a comment like // The following code | implements the functionality of <popular GPL'd library> | | 2) Have library implemented magically for you | | 3) Delete top comment if necessary :P | | (It's pretty unlikely that this will actually work but the | approach could well do.) | freshhawk wrote: | I suppose someone should make a OS-generating AI, conceptually | it can just have windows, osx and some linux distros in it and | output one based on a question about favorite color or | something. | | You'd just have to wrap it in a nice complex model | representation so it's a black box you fed example OS's with | some meta-data into and it happens to output this very useful | data. | | After all, once you use something as input to a machine | learning model apparently the license disappears. Sweet. | bogwog wrote: | That would be interesting: | | * Someone leaks Windows 10/11 source code | | * Copilot picks it up in its training data | | * Someone uses copilot to generate a Windows clone and starts | selling it | | I wonder how Microsoft would react to that. I wonder if | they've manually blacklisted leaked source code from Windows | (or other Microsoft products) so that it doesn't show up in | Copilot's training data. If they have, that means Microsoft | recognizes the IP risks of having your code in that data set, | which would make this Copilot thing not just the result of | poor planning/maybe a little incompetence, but something much | more devious and malicious. | | If Microsoft is going to defend this project, they should | introduce _all_ of their own source code into the training | data. | DemocracyFTW wrote: | > source code | | why do you think it has to be source code? it could be the | compiled code after all. | | If what we're talking / fantasizing about here works in the | way of `let x = 42` it should equally well work with `loda | 42` &cpp, so source code be damned. It was ever only to be | an intermediate step, inserted between the idea and the | working bits, to enable humans to helpfully interfere. | Dispensable. | treesprite82 wrote: | > Someone uses copilot to generate a Windows clone | | You could test this with one of Microsoft's products that | is already on GitHub - like VSCode. I doubt you would get | anywhere with just copilot. | bogwog wrote: | You probably won't get an entire operating system out of | it, but I could totally see a project like Wine using it | to implement missing parts of the Win32 API and improve | their existing implementations. | aj3 wrote: | Come on, there is a huge gap between 1) writing a single | function (potentially incorrectly) with a known | prototype/interface and a description and 2) designing | interfaces, datatypes and APIs themselves. | bogwog wrote: | Why would you need to design anything? Just copy official | Windows headers and use copilot to implement individual | functions. | | Maybe if the signature matches perfectly, copilot will | even pull in the exact implementation from the Windows | source code. | methyl wrote: | What stops you to do the same, without the AI part? | petercooper wrote: | That's what I was wondering. I've never been interested | enough to steal anyone else's code, but with all the code | transformers and processing tools nowadays, I imagine it's | trivial to translate source code into a functionally | equivalent but stylistically unique version? | pjerem wrote: | The question is not if it's trivial or not, but if it is | legal or not. You can already technically steal GPLv2 by | obfuscating it. | formerly_proven wrote: | Assuming ML models are causal, then bits of GPL code that | fall out of the model have to have the color GPL, because | the only way they could've gotten there was to train the | ML using GPL-colored bits. It seems to me like the answer | here is pretty obvious, it doesn't really matter how you | copy a work. | Rapzid wrote: | Bits? | shadilay wrote: | Would it be possible to do this in reverse assuming the AI has | some proprietary code in its training data? | bogwog wrote: | Yes this is a concern, but I'm not sure if the AI is actually | able to "generate" a non-trivial piece of code. | | If you tell it to generate "a function for calculating the | barycentric coordinates of a ray-triangle intersection", you | might get a working implementation of a popular algorithm, | adapted to your language and existing class/function/variable | names. | | But if you tell it to generate "a smartphone operating system", | it probably won't work...and if it does, it would most likely | use giant chunks of Android's codebase. | | And if that's true, it means that copilot isn't really | _generating_ anything. It 's just a (high-tech) search engine | that knows how to adapt the code it finds to fit your codebase. | That's still a really cool technology and worth exploring, but | it doesn't do enough to justify ignoring software licenses. | treis wrote: | >But if you tell it to generate "a smartphone operating | system", it probably won't work...and if it does, it would | most likely use giant chunks of Android's codebase. | | But since now APIs are unprotected you could feed it all of | the class structure and method signatures to have it fill in | the blanks. I don't know if that gets you a working operating | system but it seems like it will get you quite a long way | saba2008 wrote: | How is it different from just copy-pasting? | | It does add some degree of plausible deniability (accidental | violation, instead of intentional), but I don't think it would | matter much. | rlpb wrote: | > Bang! you have a non-gpl copy of the program! Moreover, it is | 100% yours and you can re-license it as you want. This will be | a boon for copyleft-allergic developers! | | Thinking that this would conveniently bypass the fact that your | goal was to copy the code seems to be the most common legal | fallacy amongst software developers. The law will see straight | through you, and you will be found to have infringed copyright. | The reason is well explained in "What Colour are your bits?" | (https://ansuz.sooke.bc.ca/entry/23). | enriquto wrote: | My message was sarcastic. I'm worried about accidental | conversion of free software into proprietary. I mean, | "accidental" locally, in each particular instance; but maybe | non accidental in the grand scheme of things. | | EDIT: to I can write my worry, semi-jokingly, as a conspiracy | theory: Microsoft is using thousands of unsuspecting (and | unwilling) developers to turn a huge copylefted corpus of | algorithms into non-copylefted implementations. Even assuming | that developers that use the co-pilot use non-copyleft | licenses only 50% of the time, there's still a constant | trickling of un-copyleftization. | alkonaut wrote: | I don't think most of us are scared enough of being "tainted" | by the sight of a GPL snippet that we'd bother. Besides, if you | want to target a specific snippet so you can type the start to | prime the recognition - you already saw it? | | Why not just copy it and then edit it? If a snippet is changed | both logically and syntactically to not resemble the original, | then it's no longer the original and you aren't in any | licensing trouble. There is no meaningful difference between | that manual washing and a clean room implementation. All the ML | changes here is the accidental vs deliberate. But it will be a | worse wash than your manual one. | ralph84 wrote: | I get the sense that GitHub _wants_ this to be litigated so the | case law can be established. Until then it's just a bunch of | internet lawyers arguing with each other. | MadAhab wrote: | I got the sense they saw Google beating Sun/Java in the supreme | court and said "We'll be fine, lets move the release up" | pjfin123 wrote: | Why would you want to? For many open source developers having | models trained on your code would be desirable. | tyingq wrote: | _" We found that about 0.1% of the time, the suggestion may | contain some snippets that are verbatim from the training set"_ | | If it's spitting out verbatim code 0.1% of the time, surely it's | spitting out copied code where only trivial things are different | at a much higher rate. | | Trivial things meaning swapped order where order isn't important, | variable/function names, equivalent ops like +=1 vs ++, etc. | | Surely it's laundering some GPL code, for example, and | effectively removing the license in a way that sounds fishy. | dwheeler wrote: | It's not just the GPL. Almost all open source software licenses | require attribution; without that attribution, any copy is a | license violation. | | Whether or not the result _is_ a license violation is tricky | legal question. As always, IANAL. | tyingq wrote: | I did say "for example". | devetec wrote: | You could say a human is laundering GPL code if they learned | programming from looking at Github repositories. Would you, | though? The type of model they use isn't retrieving, it's | having learned the syntax and the solutions that are used, just | like a human would. | thrwaeasddsaf wrote: | > You could say a human is laundering GPL code if they | learned programming from looking at Github repositories. | | I don't have photographic memory, so I largely don't memorize | code. I learn general techniques, and memorize simple facts | such as APIs. I can memorize some short snippets of code, but | these probably aren't enough to be copyrightable anyway. | | > The type of model they use isn't retrieving | | How do we know? It think it's very likely that it is largely | just retrieving code that it memoized, and doing minor | adjustment to make the retrieved pieces fit the context. That | wouldn't differ much from finding code that matches the | problem (whether on SO or Github), copy pasting the | interesting bits, and fixing it until it satisfies the | constraints of the surrounding code. | | I think the alternative to retrieving would actually require | a higher level understanding of the world, and the ability to | reason from first principles; that would be much closer to | AGI. | | For example, if I want to implement a linked list, I'm not | going to retrieve an implementation from memory (although | given that linked lists are so simple, I probably could). I | know what a linked list is and how it works, and therefore I | can produce working code from scratch.. _for any programming | language, even ones for which no prior implementations | exist._ I doubt co-pilot has anything remotely as advanced as | this ability. No, it fully reliant on just retrieving and | reshaping a pieces of memoized code; it needs a large corpus | of code to memoize before it can do anything at all. | | I don't need a large corpus of examples to copy, because I | use my ability to reason in conjunction with some memoized | general techniques and common APIs in order to produce code. | drran wrote: | I have a much simpler AI Copilot, called "cat", which spills | verbatim code more frequently, but it's OK for me. Can I train | it on M$ code? | rhacker wrote: | I mean this is already happening. When you hire a specialist in | C# servers, you're copying code that they already wrote. I find | people tend to write the same functions and classes again and | again and again all the time. | | We have a guy that brought his task manager codebase (he re-wrote | it) but it's the same thing he used at 2 other companies. | | I have written 3 MPIs (master person/patient index) at this point | all with the same fundamental matching engine. | | I mean, one thing we can all agree on is that ML is good at | copying what we already do. | tomcooks wrote: | The amount of people not knowing the difference between Open | Source and Free Software is astonishing. With the amount of RMS | memes I see regularly I would expect things to be settled by now. | sydthrowaway wrote: | I'm worried about my job. What do I do to prepare? | ostenning wrote: | There are much bigger things in this world to worry about. I | bet you that by the time that this AI has taken your job, it'll | have taken many other jobs, completely rearranging entire | industries if not society itself. | | And even once that happens you shouldn't be worried about your | job. Why? Because economically everything will be different and | because your job isn't that important, it likely never was. The | problems humanity faces are existential. Authoritarianism, | ecosystem collapse and mass migration of billions of people. | | So if you really want to "prepare", then try to make a | difference in what actually matters. | cycomanic wrote: | In the discussion yesterday I pointed to the case of some | students suing turnitin for using their works in the turnitin | database and the studemts lost [1]. I think an individual suing | will not go anywhere. The way to create a precedent is someone | feeding all the Harry Potter books and some additional popular | books (twilight?) to GPT 3 and letting them write about some kids | at a sorcerer school. The outcomes of that case would look very | different IMO. | | [1] https://www.plagiarismtoday.com/2008/03/25/iparadigms- | wins-t... | anfelor wrote: | Not a lawyer, but in that case it seemed to be a factor that | turnitin was transformative, because it never sold the texts to | others and thus didn't reduce the market value of them. But | that wouldn't apply to copilot which might reduce the usage of | libraries since you can "code" equivalent functionality with | copilot now. | | Would it be a stretch to assert that GPL'd libraries have a | market value for their creator in terms of reputation etc.? | visarga wrote: | While we're worrying about ML learning to write our codes we | should also break all the automated looms so people don't go | without jobs. Do everything manually like God intended! /s | | Maybe a code that is easily recreated by GPT with a simple | prompt is not worth copyrighting. The future is in making it | more automated, not protecting IP. If you compete against a | company using it, you can't ignore the advantage. | shawnz wrote: | Disney's intellectual property would be a good choice for this | exercise | intricatedetail wrote: | Suing will not go anywhere because Microsoft has billions at | their disposal to defend any case. | warpech wrote: | If GitHub Copilot can sign my CLA, stating that it is the author | of work, that it transfers the IP to me in exchange for the | service subscription price and holds responsibility for copyright | infringement, that would be acceptable. Otherwise it's a gray | area I don't want to go. ___________________________________________________________________ (page generated 2021-06-30 23:01 UTC)