[HN Gopher] In the LLM space, "open source" is being used to mea... ___________________________________________________________________ In the LLM space, "open source" is being used to mean "downloadable weights" Author : FanaHOVA Score : 289 points Date : 2023-07-21 15:49 UTC (7 hours ago) (HTM) web link (www.alessiofanelli.com) (TXT) w3m dump (www.alessiofanelli.com) | spullara wrote: | It remains to be seen in court whether weights are even | copyrightable potentially making all the various licenses and | their restrictions moot. | humanistbot wrote: | And it also remains to be seen if various legislatures will | pass laws that explicitly declare the copyright status of model | weights. It is important to remember that what is or is not | copyrightable can change. | rvcdbn wrote: | At least in the US copyright is established by the | constitution so not sure how much it's possible to change via | the normal legislative process. | gpm wrote: | The US constitution grants congress the ability to create | copyright ("To promote the progress of science and useful | arts, by securing for limited times to authors and | inventors the exclusive right to their respective writings | and discoveries"), but it doesn't create copyright law | itself. That's a broad clause that gives Congress pretty | free reign to change how copyright is defined. | rvcdbn wrote: | Constitutionality is also about how previous cases have | been evaluated for example see the bit about how | photography copyright was established here: https://const | itution.congress.gov/browse/essay/artI-S8-C8-3-... | rvcdbn wrote: | specifically: | | > A century later, in Feist Publications v. Rural | Telephone Service Co., the Supreme Court confirmed that | originality is a constitutional requirement | ljdcfsafsa wrote: | 1. Why wouldn't they be and 2. Does that even matter? If you | enter into a contract saying don't do X, and you do X, you're | violating the contract. | sebzim4500 wrote: | I assume GP was talking about a scenario in which you had not | entered into a contract with Meta. E.g. if I just downloaded | the weights from someone else. | rvcdbn wrote: | 1 - because they lack originality, see: https://constitution. | congress.gov/browse/essay/artI-S8-C8-3-... | dvdkon wrote: | In a similar vein, the common "you may not use this model's | output to improve another model" clause is AFAIK unenforceable | under copyright, so it's _at best_ a contractual clause binding | a particular user. Anyone using that improved model afterward | is in the clear. | ljdcfsafsa wrote: | > it's at best a contractual clause binding a particular | user. Anyone using that improved model afterward is in the | clear. | | That's... not really accurate. See the concept of tortious | interference with a contract. | dvdkon wrote: | Hm, I don't know much about common law, but I don't think | this would apply if, say, an ML enthusiast trained a model | from LLaMA2 outputs, made it freely available, then someone | else commercialised it. The later user never caused the | original developer to breach any contract, they simply | profited from an existing breach. | | That said, doing this inside one company or with | subsiduaries probably wouldn't fly. | taneq wrote: | And of course anyone using a model improved by this is | entirely unworried by these clauses if their improved model | takes off hard. | banana_feather wrote: | The idea is that if you violate the terms of the license to | develop your own model, you lose your rights under the | license and are creating an infringing derivative work. If I | clone a GPL'd work and ship a derivative work under a | commercial license, downstream users can't just integrate the | derivative work into a product without abiding by the GPL | terms and say "well we're downstream relative to the party | who actually copied the GPL'd work, so the GPL terms don't | apply to us". | dvdkon wrote: | Thing is, the outputs of a computer program aren't | copyrightable, so it doesn't matter if your improved model | is a derivative work. What you say would apply if you | derived something from the weights themselves (assuming | they are copyrightable, of course). | blendergeek wrote: | If such a "derivative" model is a derivative work, then | aren't all these LLMs just mass copyright infringement? | banana_feather wrote: | At the end of the day it's not black and white, but | there's a large and obvious difference in degree that | would plausibly permit someone to find that one is and | the other isn't. It's fairly easy to argue that using the | outputs of LLM X to create a slightly more refined LLM Y | creates a derivative work. The argument that a model is a | derivative work relative to the training data is not so | clear cut. | dragonwriter wrote: | If model weights aren't copyrightable, derivative model | weights are not a "work", derivative or otherwise, for | copyright purposes. | | If they are, and the license allows creating finetuned | models but not using the output to improve the model, | then the derived model is not a violation, but it might | be a derivative work. | dTal wrote: | Exactly this. What's good for the goose is good for the | gander! | rodoxcasta wrote: | If the weights are not copyrighteable, you don't need a | licence do use them, they are just data. There's not a | right to infringe if these numbers have no author. Of | course, to use openAI API you must abide to their terms. | But if you publish your generations and I download them, I | have nothing to do to the contract you have with openAI | since I'm no part of it. They can't impede me to use it to | improve my models. | diffeomorphism wrote: | Really? | | Your customers bought that product under license A. | Afterwards it turned out that you pirated some artwork from | disney. Then your customer can sue you (not disney) to make | things right. The specific license of the original work | seems quite irrelevant here. | pessimizer wrote: | Not at all. The reason your customer can sue you is | because Disney can sue your customer. Disney would be | suing your customer under the specific license of the | original work. | | edit: you seem to see the customer as the primary victim | here instead of Disney, but if Disney weren't a victim | the customer wouldn't have a case. | stale2002 wrote: | No, because the premise of the hypothetical is that the | weights aren't protected by copyright. | | So, no matter what they TOS says, it's not an infringing | work. | | > Downstream users can't just integrate the derivative work | into a product without abiding by the GPL terms | | You absolutely could do this if the original work is not | protected by copyright, or if you use it in a way that is | transformative and fair use. | mattl wrote: | Something under the GPL is also copyrighted. The GPL is a | copyright license. | stale2002 wrote: | If the underlying work is not protected by copyright, it | doesn't matter what license someone tries to put on it. | | Similarly, if someone creates a fair use/transformative | work then the license can also be ignored. | FanaHOVA wrote: | Yep, same with SSPL. GPL has been tested in FSF vs Cisco | (2008), but none of the more restrictive licenses have. | jrockway wrote: | It seems like a dangerous clause to me. | | 1) "Dear artists, the model cannot infringe upon your copyright | because it's merely learning like a human does. If it | accidentally outputs parts of your book, you know, it just | accidentally plagiarized. We all do it haha! Our attorneys | remind you that plagiarism is not illegal in the US." | | 2) "Dear engineers, the output of our model is copyrighted and | thus if you use it to train your own model, we own it." | | I am not sure how both of those can be true at the same time. | jimmaswell wrote: | We all truly do "accidentally plagiarize", especially | artists. Many guitarists realize they accidentally copied a | riff they thought they'd come up with on their own for | example. | jrockway wrote: | I, for one, welcome our new plagiarism overlords. | | Oops. | | I added the "haha" in there because the probability of a | human doing this kind of goes way down as the length of the | text increases. Can you type, verbatim, an entire chapter | of a book? I can't. But, I bet the AI can be convinced in | rare cases to do that. | | The whole thing is very interesting to me. There was an | article on here a couple days ago about using gzip as a | language model. Of course, gzipping a book doesn't remove | the copyright. So how low does the probability of | outputting the input verbatim have to be before copyright | is lost? | | Reading the book and benefitting from what you learned? | Obviously not copyright infringement. Putting the book into | gzip and sending your friend the result? Obviously | copyright infringement. Now we're in the grey area and ... | nobody knows what the law is, or honestly, even how to | reason about what the law wants here. Fun times. | | (Personally, I lean towards "not copyright infringement", | but I'm not a big believer in copyright myself. In the case | of AI training, it just makes it impossible for small | actors to compete. Google can just buy a license from every | book distributor. SmolStartup can't. So if we want to make | AI that is only for the rich and powerful, copyright is the | perfect tool to enable that. I don't think we want that, | though. | | My take is that the rest of society kind of hates Tech | right now ("I don't really like my Facebook friends, so | someone should take away Mark Zuckerberg's money."), so | it's likely that protectionist laws will soon be created | that ruin it for everyone. The net effect of that is that | Europe and the US will simply flat-out lose to China, which | doesn't care about IP.) | spullara wrote: | There are people that can type, verbatim, the entire | chapters of books. | Der_Einzige wrote: | The overwhelming majority of all human advancement is in | the form of interpolation. Real extrapolation is extremely | rare and most don't even know when it's happening. This is | why it's extremely hypocritical for artists of any sort to | be upset about Generative AI. Their own minds are doing the | same exact thing they get upset about the model doing. | | This is why fundamental "interpolative" techniques like | ChatGPT (whose weights are in theory frozen) is still | basically super-intelligent. | polotics wrote: | Wow you appear to know a great deal about how human minds | work: "doing the same exact thing they get upset about | the model doing"... May I query you put up a list of | publications on the subject of how minds work? | Der_Einzige wrote: | My insights are widely accepted theories from various | fields, all available in the public domain. | | It's a well-understood concept that our minds function by | making sense of the world through patterns. This is the | essence of interpolation - taking two known points and | making an educated guess about what lies in between. Ever | caught yourself finishing someone's sentence in your mind | before they do? That's your brain extrapolating based on | previous patterns of speech and context. These processes | are at the heart of human creativity. | | The field of Cognitive Science has extensively documented | our tendency for interpolation and pattern recognition. | Works like The Handbook of Imagination and Mental | Simulation by Markman and Klein, or even "How Creativity | Works in the Brain" by the National Endowment for the | Arts all attest to this. | | When artists create, they draw from their experiences, | their knowledge, their understanding of the world - a | process overwhelmingly of interpolation. | | Now, I can see how you might be confused about my | reference to ChatGPT being "super-intelligent". Perhaps | "hyper-competent" would be more appropriate? It has the | ability to generate text that appears intelligent because | it's interpolating from a massive amount of data - far | more than any human could consciously process. It's the | ultimate pattern finder. | | And that, my friend, is my version of "publications on | the subject of how minds work." I may not be an | illustrious scholar, but hey, even a clock is right twice | a day! And who knows, maybe I'm on to something after | all. | saghm wrote: | There was a famous case where John Fogerty (formerly of | Creedence Clearwater Revivial) ended up getting sued by | CCR's record label, claiming a later solo song he did with | a different label was too similar to a CCR song that he | wrote, and they won. So legally speaking, you can even get | in trouble for coming up with the same thing twice if don't | own the copyright of the first one. | rcxdude wrote: | The copyright situation with music is kinda broken, | different parts of the performance get quite different | priority when it comes to copyright (many core elements | of a performance get basically no protection, whereas the | threshold for what counds as a protectable melody is | absurdly low). Especially this means its less than | worthless for some genres/traditions: for jazz and blues, | especially, a huge part of the genre and culture is | adapting and playing with a shared language of common | riffs. | luma wrote: | 2) doesn't line up with the US court's current stance that | only a human can hold copyright, and thus anything created by | a not-human cannot have copyright applied. This applies to | animals, inanimate objects, and presumably, AI. | | I have no idea how this impacts the encodability of the | license from FB which may rely on things other than | copyright, but as of right now, the output absolutely cannot | be copyrighted. | jrockway wrote: | That's an extremely good point. The output of software is | never copyrightable. What makes language models not | software? | danielbln wrote: | Isn't Photoshop software? | pessimizer wrote: | Photoshop's output has been completely guided (until | recent additions) by a human who can hold a copyright. | | That being said, isn't a prompt guidance? | sangnoir wrote: | If they are nor copyrightable, that'll be the end of publicly- | released weights by for-profit companies. All subsequent models | will be served behind an API. | dragonwriter wrote: | > If they are nor copyrightable, that'll be the end of | publicly-released weights by for-profit companies | | I don't see why, for-profit companies release permissively- | licensed ooen-source code all the time, and noncopyrightable | models aren't practically much different than that. | bilbo0s wrote: | Because the courts will have determined their business | models for them. | | As mercenary as it may sound, what these companies are | trying to do is find a business model that is as friendly | to themselves as it is hostile to their competitors. | | This is all part of the jockeying. | dragonwriter wrote: | And, sure, lack of copyrightability changes the | parameters and will change behavior. What I think you | have failed to support is that the _particular_ change | that it will induce will eliminate all such releases. | sangnoir wrote: | I debated whether to be more specific and verbose in my | earlier comment and brevity won at the expense of clarity. | I meant large models that cost 6 or 7 digits to train | likely won't be released if the donor company can't control | how the models are used. | | > I don't see why, for-profit companies release | permissively-licensed ooen-source code all the time | | I agree with this - however, they tend to open-source non- | core components - Google won't release search engine code, | Amazon wont release scalable-virtualization-in-a-box, etc. | | I'm confident that Facebook won't release a hypothetical | Llama 5 in a manner that enables it to be used to improve | ChatGPT 8 - the aim will be unchanged from today, byt the | mechanism will shift from licensing to rate-limiting, | authentication & IP-bans. | weinzierl wrote: | I find the idea that weights are not copyrightable very | fascinating - appealing even. I have a hard time imagining a | world where this is the case, though. | | Can you summarize why weights would not be copyrightable or | give me pointers to sources that support that view. | cbm-vic-20 wrote: | An analog to this might be the settings of knobs and switches | for an audio synthesizer, or guitar effects settings. If you | wanted to get the "Led Zeppelin sound" from a guitar, you | could take a picture of the knobs on the various pedals and | their configuration, and replicate that yourself. You then | create a new song that uses those settings. Is that something | that is allowed under copyright? | | What if there were billions of knobs, tuned after years of | feedback and observations of the sound output? | paxys wrote: | I don't think that's a good analogy. A piano has N keys. | You can press certain ones in certain combinations and | write it down. That result is still copyrightable, because | you can prove that it was an original and creative work. | Setting knobs for a machine is no different, but the key | differentiator is if you did it yourself or if an algorithm | did it for you. | cbm-vic-20 wrote: | In my analogy, it's not the sequence of the notes or the | composition, which I agree is copyrightable. But are the | settings of the knobs and switches on synthesizers and | effects devices used in a recording equivalent to the | weights of a neural network or LLM? And if so, are those | settings or weights copyrighitable? | rvcdbn wrote: | That's a bad analogy because a human chose the values of | those settings using their creative mind. That's not at all | the case with weights. This originality is the heart of | copyright law. | slimsag wrote: | Speculating (I am not a lawyer) I see two options: | | 1. Model weights are the output of mathematical principles, | in the US facts are not copyrightable, so in general math is | not copyrightable. | | 2. Model weights are the derivative work of all copyrighted | works it was trained on - in which case, it would be similar | to creating a new picture which contains every other picture | in the world inside of it. Who is the copyright owner? Well, | everyone, since it includes so many other copyright holders' | works in it. | humanistbot wrote: | Your second argument, if true, disproves your first | argument. | slimsag wrote: | Doesn't matter. A court decides in the end, and the two | choices I presented could lead to OPs scenario. If a | court decides that, they decide that, period. I'm not | 'making an argument' with those points - I'm presenting | options a court might choose from when setting precedent. | FishInTheWater wrote: | Remember that database rights are a thing. | | One cannot hold copyright facts, but one can "copyright" a | collection of facts like a search index or a map. | earleybird wrote: | Your second question asks: "Who owns the Infinite | Library[0]?" | | related, there was a presentation (i've lost the reference) | on automatic song (tune?) generation where the presenter | claimed (rather humourusly) that he'd generated all the | songs that had ever been and will ever be so that while he | was infringing on a large but finite number of songs, he | was non infringing on an infinite number of future songs. | So, on balance he was in a favourable position. | | [0] https://en.wikipedia.org/wiki/The_Library_of_Babel | sebzim4500 wrote: | Generally the output of a machine is not copyrightable. | Similarly, the contents of a phone book is not copyrightable | in the US even if the formatting/layout is. So I could take a | phonebook and publish another one with identical phone | numbers as long as I laid it out slightly differently. | xxpor wrote: | Work also has to be "creative" in order for it to be | eligible for copyright. This is why photomasks have | special, explicit protection in US law; they're not really | "creative" in that way. | | https://en.wikipedia.org/wiki/Integrated_circuit_layout_des | i... | cal85 wrote: | What about compiled binaries? If I write my own original | source code (and thus automatically own the copyright to | it), and compile it to binary, is the binary not protected | to? | sebzim4500 wrote: | No, because you the input to that process was a bunch of | work that you did. | | In the case of an LLM, I don't think that the work of | compiling the training data probably would qualify by | analogy to the phonebook example. | humanistbot wrote: | By that logic, if you convert a copyrighted song or movie | from one codec to another, then that would not be | copyrightable because it is the output of a machine. | xxpor wrote: | The song itself isn't output by the machine. | humanistbot wrote: | Neither was the original training data, which was | copyrighted books, art, etc. | dragonwriter wrote: | > Neither was the original training data, which was | copyrighted books, art, etc. | | If the original training data is a copyrightable | (derivative or not) work, perhaps eligible for a | compilation copyright, the model weights might be a form | of lossy mechanical copy of _that_ work, and be both | subject to its copyright and an infringing unauthorized | derivative if it is. | | If its not, then I think even before fair use is | considered the only violation would be the weights | potentially infringing copyrights on original works, but | I don't think _incomplete_ copy automatically works for | them the way it would for an aggregate; I'd think you 'd | have to demonstrate reproduction of the creative elements | protected by copyright from _individual_ source works to | make the claim that it infringed them. | xxpor wrote: | The _output_ of the training though is unrecognizable. | SideburnsOfDoom wrote: | Sometimes, the output is a recognisable plagiarism of a | specific input. | | If it isn't recognisable, then it's merely _distributed_ | plagiarism. A million output, each of which are 0.0001% | plagiarising each of million inputs. | dragonwriter wrote: | It _isn't_ independently copyrightable. | | Its a mechanical copy subject to the copyright on the | original, though. | danShumway wrote: | Correct that it would not be copyrightable, but you're | missing the point. | | A codec conversion is not copyrightable. The original | _song_ which is still present enough in the conversion to | impact its ability to be distributed, is still | copyrightable. But you don 't get some kind of new | copyright just because did a conversion. | | For comparison, if you take a public domain book off of | Gutenberg and convert it from an EPUB to a KEPUB, you | don't suddenly own a copyright on the result. You can't | prevent someone else from later converting that EPUB to a | KEPUB again. Copyright protects creative decisions, not | mathematical operations. | | So if there is a copyright to be held on model weights, | that copyright would be downstream of a creative decision | -- ie, which data was it trained on and who owned the | copyright of the data. However, this creates a weird | problem -- if we're saying that the artifact of | performing a mathematical operation on a series of inputs | is still covered by the copyright of the components of | that database, then it's somewhat tricky to argue that | the creative decision of what to include in that database | should be covered by copyright but that copyrights of the | actual content in that database don't matter. | | Or to put it more simply, if the database copyright | status impacts models, then that's kind of a problem | because most of the content of that training database is | unlicensed 3rd party data that is itself copyrighted. It | would absolutely be copyright infringement for | OpenAI/Meta to distribute its training dataset | unmodified. | | AI companies are kind of trying to have their cake and | eat it too. They want to say that model weights are | transformed to such a degree that the original copyright | of the database doesn't matter -- ie, it doesn't matter | that the model was trained on copyrighted work. But they | also want to claim that the database copyright does | matter, that because the model was trained on a | collection where the decision of what to include in that | collection was covered by copyright, therefore the model | weights are copyrightable. | | Well, which is it? If model weights are just a | transformation of a database and the original copyrights | still apply, then we need to have a conversation about | the amount of copyrighted material that's in that | database. If the copyright status of the database doesn't | matter and the resulting output is something new, then | no, running code on a GPU is not enough to grant you | copyright and never really has been. Copyright does not | protect algorithmic output, it protects human creative | decisions. | | Notably, even if the copyright of the database was enough | to add copyright to the final weights and even if we | ignore that this would imply that the models themselves | are committing copyright infringement in regards to the | original data/artwork -- even in the best case scenario | for AI companies, that doesn't mean the weights are fully | protected because the only copyright a company can claim | is based on the decision of what data they chose to | include in the training set. | | A phone book is covered by copyright if there are | creative decisions about how that phone book was | compiled. The numbers within the phone book are not. | Factual information can not be copyrighted. Factual | observations can not be copyrighted. So we have to ask | the same question about model weights -- are individual | model weights an artistic expression or are they a fact | derived from a database that are used to produce an | output? If they're not individually an artistic | expression, well... it's not really copyright | infringement to use a phone book as a data reference to | build another phone book. | paxys wrote: | It's a complicated question and I don't think anyone can give | a clear yes or no answer before some court has ruled on it. | One school of thought is that copyright is designed to | protect original works of creativity, but weights are | generated by an algorithm and not direct human expression. AI | generated art, for example, has already been ruled ineligible | for copyright. | rvcdbn wrote: | I have a hard time imagining a world where it is not the case | at least in the US i.e. where copyright is extended to a work | with no originality in direct contradiction to copyright | clause in the constitution. | bilbo0s wrote: | It's all kind of irrelevant. If they are not copyrightable, | then most companies will simply hide them behind an API. | There is no law saying these companies _must_ release their | weights. The companies are releasing their weights because | they felt they could charge for and control other things. | Like the output from their models. | | If they can't charge for and control those other things, | then we'll likely see far fewer companies releasing | weights. Most of this stuff will move behind APIs in that | scenario. | rvcdbn wrote: | Maybe, maybe not. Companies are not monoliths. For all we | know, internally it's already well known that model | weights likely aren't copyrightable and the only reason | for the restrictions is to give the appearance of being | responsible to appease the AI doomers. | appplication wrote: | Let's take a simple linear regression model with a handful of | parameters. The weights could be an array of maybe 5 numbers. | Should that be copyrightable? What if someone else uses the | same data sources (e.g. OSS data sets) and architecture and | arrives at the same weights? Is this a Copyright violation? | | Let's talk about more complex models. What if my model shares | 5% of the same weights with your model? What about 50%? What | about 99%? How much do these have to change before you're in | the clear? What if I take your exact model and run it through | some extra layers that don't do anything, but dilute the | significance of your weights? | | It's a murky area, and I'm inclined to think copyright is not | at all the right tool to handle the legality of these models | (especially given the glaring irony they are almost all | trained using copyrighted material). Patents, perhaps better | suited, but I'm also not sold. | paulmd wrote: | > While it's mostly open, there are caveats such as you can't use | the model commercially if you had more than 700M MAUs as of the | release date, and you also cannot use the model output to train | another large language model. These types of restrictions don't | play well with the open source ethos | | No, CC-NC-ND is a thing, and even GPL applies restrictions on | derivation as well. | | "Open source" doesn't mean BSD/MIT. There is even open-source | that you cannot freely redistribute at all - not all open-source | is FOSS! | | I always think it's a testament to how much copyleft has | succeeded that in many cases people think of GPL and BSD/MIT as | being the baseline. | Taek wrote: | I didn't realize that the llama license forbids you from using | its outputs to train other models. That's essentially a | dealbreaker, synthetic data is going to be the most important | type of training data from here on out. Any model that prohibits | use of synthetic data to train new models is crippled. | Der_Einzige wrote: | It's exactly the opposite. We have better ways to combine the | knowledge of several models together than sampling them. (i.e. | mixture of experts, model merges, etc) Relying on synthetic | data from one LLM to train another LLM is in general a terrible | idea and will lead to a race to the bottom. | zarzavat wrote: | A contract ordinarily has to have consideration. Since LLaMa | weights are not copyrightable by Meta and are freely available, | what exactly is the consideration? The bandwidth they provide? | SanderNL wrote: | Good luck enforcing that, though. How would they ever know? | denlekke wrote: | i wonder if they could include some marker prompt and | response that wouldn't occur "naturally" from any other model | or training data | ortusdux wrote: | https://en.wikipedia.org/wiki/Trap_street | nsplayer wrote: | They could have picked up the LLM equivalent from LLM | generated posts online however. How do you prove they | didn't? | denlekke wrote: | as a layman, i imagine for someone at the scale required | it may not be worth the risk or the added effort vs | paying or using a different model but it'd be funny if we | see companies creating a subsidiary that just acts as a | web-passthrough to "legalize" llama2 output as training | data | mcny wrote: | Level1Techs "link show" (because we can't call it news | anymore) kind of touched this topic. I would like to read | what you guys make of this: | | > Supreme Court rejects Genius lawsuit claiming Google | stole song lyrics SCOTUS won't overturn ruling that US | copyright law preempts Genius' claim. | | > The song lyrics website Genius' allegations that Google | "stole" its work in violation of a contract will not be | heard by the US Supreme Court. The top US court denied | Genius' petition for certiorari in an order list issued | today, leaving in place lower-court rulings that went in | Google's favor. | | > Genius previously lost rulings in US District Court for | the Eastern District of New York and the US Court of | Appeals for the 2nd Circuit. In August 2020, US District | Judge Margo Brodie ruled that Genius' claim is preempted | by the US Copyright Act. The appeals court upheld the | ruling in March 2022. | | > "Plaintiff's argument is, in essence, that it has | created a derivative work of the original lyrics in | applying its own labor and resources to transcribe the | lyrics, and thus, retains some ownership over and has | rights in the transcriptions distinct from the exclusive | rights of the copyright owners... Plaintiff likely makes | this argument without explicitly referring to the lyrics | transcriptions as derivative works because the case law | is clear that only the original copyright owner has | exclusive rights to authorize derivative works," Brodie | wrote in the August 2020 ruling. | | > Google search results routinely display song lyrics via | the service LyricFind. Genius alleged that LyricFind | copied Genius transcriptions and licensed them to Google. | | > Brodie found that Genius' claim must fail even if one | accepts the argument that it "added a separate and | distinct value to the lyrics by transcribing them such | that the lyrics are essentially derivative works." Since | Genius "does not allege that it received an assignment of | the copyright owners' rights in the lyrics displayed on | its website, Plaintiff's claim is preempted by the | Copyright Act because, at its core, it is a claim that | Defendants created an unauthorized reproduction of | Plaintiff's derivative work, which is itself conduct that | violates an exclusive right of the copyright owner under | federal copyright law," Brodie wrote. | | https://arstechnica.com/tech-policy/2023/06/supreme- | court-re... | rcxdude wrote: | The basic idea is whether an unauthorised derivative work | is itself entitled to copyright protection: could the | creator of the derivative work prevent copying by the | original creator (or anyone else) of the work on which it | is based, even though they themselves have no permission | to distribute it? (if the work is authorised, this is | generally considered to be the case). It looks like from | this the conclusion is 'no', at the very least in this | case. I'm not sure this matches most people's moral | intuitions: every now and again a big company includes | some fan art in their own official release without | permission (usually not as a result of a general policy, | but because of someone getting lazy and the rest of the | system failing to catch it), and generally speaking the | reaction is negative. | joshuaissac wrote: | > whether an unauthorised derivative work is itself | entitled to copyright protection | | That is not what this court case was about. Genius had | already settled the case of unauthorised transcriptions | and had bought licences for its lyrics after a lawsuit | 2014, so its own work was no longer unauthorised. In the | case cited above, Genius was trying to enforce its claims | against Google via contract law rather than copyright | law. The court ruled that the alleged violations were | covered by copyright law, so they could only pursued via | copyright law, and that only the copyright holder (or | assignee) of the lyrics that were copied could sue Google | under it. | criddell wrote: | Disgruntled current or former employee turning in their | employer for the reward? That's how Microsoft and the BSA | used to bust people before the days of always online | software. | moffkalast wrote: | I'm not sure why anyone would even do that in the first place, | LLama doesn't generate synthetic data that would be even | remotely good enough. Even GPT 3.5 and 4 are already very | borderline for it, with lots of wrong and censored answers. And | at best you make a model that's as good as LLama is, i.e. not | very. | jstarfish wrote: | Instruction-tuning is the obvious use case. That much has | nothing to do with subjectivity, alignment or censorship, | it's will-you-actually-show-this-as-JSON-if-asked. | moffkalast wrote: | That's tuning llama which is allowed from what I | understand. Otherwise why release it at all, it's not very | functional in its initial state anyway. What that applies | to is using llama outputs to train a completely new base | model which makes no practical sense. | | As for generating jsons, that's more of a inference runtime | thing, since you need to pick the top tokens that result a | valid json instead of just hoping it returns something that | can be parsed. On top of extensive tuning of course. | lolinder wrote: | Not that it's okay for this to be in the license, but I'm | curious: what is the use case for synthetic data? Most of the | discussion I've seen has been about how to avoid accidentally | using LLM-generated data. | lmeyerov wrote: | Tuning a tiny classifier | dheera wrote: | > forbids you from using its outputs to train other models. | | I don't know how one can even forbid this. As a human, I'm a | walking neural net, and I train myself on everything that I | see, without a choice. The only difference is I'm a carbon- | based neural net. | 6gvONxR4sf7o wrote: | It's hilarious that big players in this space seem to think | these are consistent views: | | - It's okay to train a model on arbitrary internet data without | permission/license just because you can access it | | - It's not okay train a model on our model | realusername wrote: | Yes, they have to pick one or the other. Until then I'm going | to assume that the model licence doesn't apply since the | first point would be invalid and the model could not be built | in the first place. | lhnz wrote: | It tells you that they think their moat is data | quality/quantity. | torstenvl wrote: | Those are perfectly consistent, despite what ideologically- | driven people may want to believe. | | Copyright is literally the right to copy. Arbitrary Internet | data that is not _copied_ does not have any copyright | implications. | | The difference is that LLaMa imposes additional contractual | obligations that, for ideological reasons (Freedom #0), open | source software does not. | | This issue reminds me of the FSF/AGPL situation. At some | point you just have to accept that copyright law, in and of | itself, is not sufficient to control what people _do_ with | your software. If you want to do that, you have to limit end- | user freedom with an EULA. | | If someone uses LLaMa output to train models, it is unlikely | they will be sued for copyright infringement. It is far more | likely they will be sued for breach of contract. | danShumway wrote: | > Arbitrary Internet data that is not copied does not have | any copyright implications. | | Training a model on model output isn't copying. | | There's no way to phrase this where training a model on | copyrighted _human_ -generated images/text isn't copying, | but training a model on _computer_ -generated images/text | is copying. | | > If you want to do that, you have to limit end-user | freedom with an EULA. | | If you want to limit end-user freedom with a EULA, you have | to figure out how to get users to sign it. Copyright is one | way to force them to do so, but doesn't really seem | relevant to this situation if training a model on | copyrighted material is fair use. | | And again, if somebody generates a giant dataset with | LLaMA, if you want to argue that pushing that into another | LLM to train with is making a copy of that data, then | there's no way to get around the implication there that | training on a human-generated image is also making a copy | of that image. | [deleted] | [deleted] | torstenvl wrote: | > _Training a model on model output isn 't copying._ | | That's literally what I said. | | > _There 's no way to phrase this where training a model | on copyrighted human-generated images/text isn't copying, | but training a model on computer-generated images/text is | copying._ | | Literally nobody is saying that. | | > _If you want to limit end-user freedom with a EULA, you | have to figure out how to get users to sign it._ | | That is not true. ProCD v. Zeidenberg, 86 F.3d 1447 (7th | Cir. 1996). | | You and others seem to have an over-the-top hostile | reaction to the idea that contract law can do things | copyright law cannot do. But it is objective and | unarguable fact. | danShumway wrote: | > Literally nobody is saying that. | | Okay? Apologies for making that assumption. But if you're | not saying that, then your position here is even less | defensible. Arguing that model output isn't copyrightable | but that it's still covered by EULA if anyone anywhere | tries to use it is even more absurd than arguing that | it's covered by copyright. The interpretation that this | is covered by copyright is arguably the charitable | interpretation of what you wrote. | | > That is not true. ProCD v. Zeidenberg, 86 F.3d 1447 | (7th Cir. 1996). | | ProCD is about shrinkwrap licenses, the court determined | that buying the software and installing it was the | equivalent of agreeing to the license. | | In no way does that imply that licenses are enforceable | on people who never agreed to the licenses. The court | expanded what counts as agreement, it does not mean you | don't have to get people to agree to the EULA. I mean, | take pedantic issue with the word "sign" if you want | (sure, other types of agreement exist, you're correct), | but the basic point is still true -- if you want to | restrict people with a EULA, they need to actually agree | to the EULA. | | And if you don't have IP law as a way to block access to | your stuff, then you don't really have a way to force | people to agree to the EULA. Someone using LLaMA output | to train a model may have never been in a position to | agree to that EULA, and Facebook doesn't have the legal | ability to say "hey, nobody can use output without | agreeing to this" because they don't have copyright over | that output. Can they get people to sign a EULA before | downloading the weights from them? Sure. Is that enough | to restrict everyone else who didn't download those | weights? No. | | To go a step further, if you don't believe that weights | themselves are copyrightable, then putting a EULA in | front of them is even less effective because people can | just download the weights from someone else other than | Facebook. | | You can host a project Gutenberg book and get people to | sign a EULA before they download it from you, even though | you don't own the copyright. And that EULA would be | binding, yes. But you cannot host a project Gutenberg | book, put a EULA in front of it, and then claim that | people who _don 't_ download it from you and instead just | grab it off of a mirror are still bound by that EULA. | | Your ability to control access is what gives you the | ability to force people to sign the EULA. And that's kind | of dependent on IP law. If someone sticks the LLaMA 2.0 | weights on a P2P site, and those weights aren't covered | by copyright, then no, under no interpretation of US law | would downloading those weights from a 3rd-party source | constitute an agreement with Facebook. | | But even if you don't take that position, even if you | assume that model weights are copyrightable, if I | download a dataset generated by LLaMA, there is still no | shrinkwrap license on that data. | | To your original point: | | > If someone uses LLaMa output to train models, it is | unlikely they will be sued for copyright infringement. It | is far more likely they will be sued for breach of | contract. | | It is incredibly unlikely that someone using a 3rd-party | database of LLaMA output would be found to be in | violation of contract law unless at the very least they | had actually agreed to the contract by downloading LLaMA | themselves. A restriction on the usage of LLaMA does not | mean anything for someone who is using LLaMA output but | has not taken any action that would imply agreement to | that EULA. | | > You and others seem to have an over-the-top hostile | reaction to the idea that contract law can do things | copyright law cannot do. But it is objective and | unarguable fact. | | No, what we have a hostile reaction to is the objectively | false idea that a EULA covers unrelated 3rd parties. | That's not a thing, it's never been a thing. | | I don't know what to say if you disagree with that other | than that I'm putting a EULA in front of all of | Shakespeare's works that says you now have to pay me $20 | before you use them no matter where you get them from, | and apparently that's a thing you believe I can do? | wwweston wrote: | > Arbitrary Internet data that is not copied | | It's all but certainly copied, and not just in the "held in | memory" sense but actually stored along with the rest of | the training collection. What may not happen is | distribution. There's a difference in scale/nature of | copyright violation between the two but both could well be | construed that way. | | Additionally, I think there's a reasonable argument that | use as training data is a novel one that should be treated | differently under the law. And if there's not: | | > If you want to do that, you have to limit end-user | freedom with an EULA. | | What will eventually happen -- at least without some kind | of worldwide convention -- is that someone who can | successfully dodge licensing obligations will be able to | take and redistribute weight-data and/or clean-room code. | | At least, if we're adopting a "because we can" approach to | everything related. | owenfi wrote: | But you can publish the output, right? And then a "third | party" could train a different model on just that published | material without copying it or ever agreeing to a EULA. | torstenvl wrote: | If you believe that courts will find your shell game | convincing, you are free to try it and incur the legal | risk. I recommend you consult with an attorney before | doing so. | themoonisachees wrote: | You could simply train on the output straight up and | nobody would ever be able to tell anyway. | 6gvONxR4sf7o wrote: | One of the common elements of training sets for these | models (including LLama) is the Books3 dataset, which is a | huge number of pirated books from torrents. That's exactly | what you described. | | Regardless, the lack of a license cannot give you _more_ | permission than a restrictive license. You 're arguing that | if take a book out of a bookstore without paying (or | signing a contract), then I have more rights than if I sign | a contract and then leave with the book. | [deleted] | rahkiin wrote: | Like google is allowed to scrape the whole internet but | you're not allowed to scrape google. Rules for thee but not | for me | kgwgk wrote: | What rules? Google won't scrape your part of the internet | if you don't allow it, right? | makeitdouble wrote: | Google respects the "robot.txt" and asks you to use it to | opt out of their crawling. | | Parent's point is if your own scaping army respects the | "scaping.txt" and goes down on Google as they don't opt- | out in their scraping.txt, it probably wouldn't fly. | kgwgk wrote: | I don't understand. What does "Rules for thee but not for | me" mean if "google is allowed to scrape" whatever people | allows Google to scrape but "you're not allowed to scrape | google" because using the same rules | google.com/robots.txt says User-agent: * | Disallow: /search .... | makeitdouble wrote: | There's an imbalance because the robot.txt rule is | something Google pushed forward (didn't invent it, but | made it standard) and is opt-out. So yes, Google made up | their rules and won't let other people to make up their | own self-beneficial rules in a similar way. | kgwgk wrote: | > Google [...] won't let other people to make up their | own self-beneficial rules in a similar way. | | What "other people"? | | If it's the "you" who is not allowed to scrape google in | https://news.ycombinator.com/item?id=36817237 then you | can make your own "google is not allowed to scrape my | thing" rules if you think that's beneficial for you. | | If it's somehow related to LLM providers or users I doubt | that's what the original comment was referring to. | | To be clear, I understand the original comment as | LLM companies say "I can use your content and you cannot | not prevent me from doing so, but I won't allow you to | use the output of the LLM" just like Google says "I can | scrape your content and you cannot not prevent me from | doing so, but I won't allow you to scrape the output of | the search engine" | | and that doesn't seem a valid analogy. | rvnx wrote: | Also the main business model of Google (and of search | engines in general) is to republish rearranged snippets of | copyrighted content and even serve whole copies of the | content (googleusercontent cache), without prior | authorization of the copyright holders, and for-profit. | | It's completely illegal if you think about it. | | So why LLMs who crawl the internet to present snippets and | information should be treated differently from Google ? | (who also reproduce verbatim the same content without | paying any compensation to the copyright owners (all types: | text, image, code) | bayindirh wrote: | Because search engines do not create mishmash of this | data to parrot some stuff about it. Also they don't strip | the source, the license, and stop scraping my site when I | tell them. | | LLMs scrape my site and code, strip all identifying | information and license, and provide/sell that to others | for profit, without my consent. | | There are so many wrongs here, at every level. | az226 wrote: | It wouldn't. Facebook is delusional if they think the | license can pass muster. | | Presumably you can't build an LLM that is a competitor of | LlaMA using its outputs. | | But AI weights are in legal gray zone for now. So it's | muddy waters and fair game for anyone who wants to take | on the legal risks. | panzi wrote: | Not wanting to defend the likes of Google, but search | engines link the original source (in contrast to LLMs). | Their basic idea is to direct people to your content. | There are countries where content companies didn't like | what Google does: Google took them out of the index -> | suddenly they where ok with it again so that Google put | them in again. (extremely simplified story) | pyrale wrote: | > Their basic idea is to direct people to your content. | | This is less and less true, as evidenced by the | progression of 0-click searchs. | | > There are countries where content companies didn't like | what Google does: Google took them out of the index -> | suddenly they where ok with it again so that Google put | them in again. | | This story screams antitrust. | mschuster91 wrote: | > This story screams antitrust. | | It does but the complainers are usually tabloid crap | pushers whom no one in power really supports. | Andrex wrote: | > It's completely illegal if you think about it. | | Google would argue (and they won in federal court versus | the Author's Guild using this argument) that displaying | snippets of publicly-crawlable websites constitutes "fair | use." Profitability weighs against fair use but it | doesn't discount it outright. | | They would also probably cite robots.txt as an easy and | widely-accepted "opt-out" method. | | Overall, I'm not sure any court would rule against | Google's use of snippets for search. And since Google's | been around for over 20 years and they haven't lost a | lawsuit over it, I don't think it's accurate to say "it's | completely illegal if you think about it." | | US copyright law is one of those things that might seem | simple, but really isn't. Hence many of the copyright | lawsuits clogging our judicial system. | gtirloni wrote: | It just likes a little imoral vs illegal confusion. | remram wrote: | You think search engines are immoral? You think we should | pay to view the snippets under the results we don't | click? | whatshisface wrote: | The belief that makes them consistent is that the authors of | a million Reddit posts have no way to assert their rights | while the big company that trained a Redditor model does. | LastTrain wrote: | Sure they do, albeit a shitty one: it's called a class- | action. | tasubotadas wrote: | Generate data using ai, save it, it cannot be copyrighted or | anything, data isn't a model, use it as much as you want for | training. | | Ezpz | redox99 wrote: | It's so hypocritical, it's insane. | | "Yes, we train our models on a good chunk of the internet | without asking permission, but don't you dare train on our | models' output without our permission!" | | And OpenAI also has a similar restriction. | alerighi wrote: | In fact they can't (both Facebook and OpenAI) train their | models without asking permission. Just wait for someone to | start raising this concern. The EU is working on regulating | these kind of aspects, for example this is not compliant at | all with the GDPR (unless you train only on data that doesn't | contain personal data, that is more rare than you would | think). | concinds wrote: | Fundamentally untrue, and disheartening that it's the top | comment. | | You can't use a model's output to train another model, it leads | to complete gibberish (termed "model collapse"). | https://arxiv.org/abs/2305.17493v2 | | And the Llama 2 license allows users to train derivative | models, which is what people really care about. | https://github.com/facebookresearch/llama/blob/main/LICENSE | rgoldste wrote: | The truth is between these two. You can use a model's output | to train another model, but it has drawbacks, including model | collapse. | danShumway wrote: | I don't see how this would be enforceable in law without | killing almost every AI company on the market today. | | The whole legal premise of these models is that training on | copyrighted material is fair use. If it's not, then... I mean | is Facebook trying to claim that including copyrighted material | in a dataset _isn 't_ fair use regardless of the author's | wishes? Because I have bad news for LLaMA then. | | "You need permission to train on this" is an interesting legal | stance for any AI company to take. | doctorpangloss wrote: | > The whole legal premise of these models is that training on | copyrighted material is fair use. | | Not to diminish the conversation here, but not even a Supreme | Court Justice knows what the legality is. You'd have to be a | whole 9 person Supreme Court to make an accurate statement | here. I don't think anyone really knows how Congress meant | today's laws to work in this scenario. | mschuster91 wrote: | > I don't think anyone really knows how Congress meant | today's laws to work in this scenario. | | Congress, or more accurate, the drafters of the | Constitution, intended that Congress would work to keep the | Constitution updated to match the needs of modern times. | Instead, Congress ossified to the point it's unable to pass | basic laws because a bunch of far right morons hold the | House GQP hostage and an absurd amount of leverage was | passed to the executive and the Supreme Court as a result - | with the active aid of both parties by the way, who didn't | even think of passing actual laws to codify something as | important as equitable access to elections, fair elections, | or the right to have an abortion or to smoke weed. And on | top of that your Supreme Court and many Federal court picks | were hand-selected from a society that prefers a literal | viewpoint of the constitution. | | But fear not, y'all are not alone in this kind of idiocy, | just look at us Germans and how we're still running on fax | machines. | rcxdude wrote: | From my non-legal-professional POV I can see an angle which | may work: | | Firstly, llama is not just the weights, but also the code | alongside it. The weights may or may not be copyrightable, | but the code is (and possibly also the network structure | itself? that would be important if true but I don't know if | it would qualify). | | Secondly, you can write what you want in a copyright license: | you could write that the license becomes null and void if the | licensee eats too much blue cheese if you want. | | Following from that, if you were to train on the outputs of | the AI, you may not be guilty of copyright infringement in | terms of doing the training (both because AI output is not | copyrightable in the first place, something which seems | pretty set in precedent already, and possibly also because | even if it was, it gets established that it is fair use like | any other data), but if it means your license to the original | code is revoked then you will at the very least need to find | another implementation that can use the weights, or (if the | weights can be copyrighted, which I would argue is probably | not the case, if you follow the argument that the training is | fair use, especially if the reasoning is that the weights are | simply a collection of facts about the training data, but | it's very plausible that courts will rule differently here). | | This could wind up with some strange situations where someone | generating output with the intent of using it for training | could be prosecuted (or at least forced to cease and desist) | but anyone actually using that output for training would be | in the clear. | | I agree it is extremely "have your cake and eat it" on the | part of the AI companies: They wish to both bypass copyright | and also benefit from the restrictions of it (or, in the case | of OpenAI, build a moat by lobbying for restrictions on the | creation and use of the models themselves, by playing to | fears of AI danger). | danShumway wrote: | These are good points to bring up. | | > This could wind up with some strange situations where | someone generating output with the intent of using it for | training could be prosecuted (or at least forced to cease | and desist) but anyone actually using that output for | training would be in the clear. | | I'll add to this that it's not just output; say that | someone is using another service built on top of LLaMA. | Facebook itself launched LLaMA 2.0 with a public-facing | playground that doesn't require any license agreement or | login to use. | | You can go right now and use their public-facing portal and | generate as much training data as you can before they IP- | block you, and... as far as I can tell you haven't done | anything in that scenario that I can see that would bind | you to this license agreement. | | So I still feel like I'll be surprised if any AI company | that's serious about wanting bootstrapping itself off of | LLaMA is going to be too concerned about this license | (whether that's a good idea to do just because the training | data itself might be garbage is another conversation). It | just seems so easy to get around any restrictions. | Ajedi32 wrote: | I'd say it's enforceable in the sense that if you agree to | the license then violating those terms would be breach of | contract regardless of whether use of the LLaMA v2 output is | protected by copyright or not. But there's nothing stopping | someone else who didn't agree to the license from using | output you generate with LLaMA v2 to train their model. | danShumway wrote: | I don't want to dip too much into the conversation of | whether weights themselves are copyrightable, but note that | it's very easy in the case of LLaMA 1.0 to get the weights | and play with them without ever signing a contract. | | If they turn out to be not copyrightable, then... all this | would mean is downloading LLaMA 2.0 weights from a mirror | instead of from Facebook. | renewiltord wrote: | I would just do it anyway. In fact, I can release a suitably | laundered version and you'd never know. If I release a few | million, each with slight variation, there's no way provenance | can be established. And then we're home-free. | objektif wrote: | I played with Llama2 for a bit and for a lot of the questions I | asked I got complete made up garbage stuff. Why would you want | to train on it? | heyzk wrote: | You see a similar loosening of the term in other fields e.g. open | source journalism. Although that seems to be more about | crowdsourcing than transparency or usage rights. | PreInternet01 wrote: | It's not just in the LLM space; even for 'older' models, | companies have aggressively embraced this approach. For example: | YOLOv3 has been appropriated by a company called Ultralytics, | which has subsequently released the 'YOLOv5' and 'YOLOv8' | "updates": https://github.com/ultralytics/ultralytics | | There is no marked increase in model effectiveness in these 'new' | versions, but even if you just use the 'YOLOv8' Pytorch weights | (and no part of their Python toolchain, which _might_ have some | improvements), these will somehow try to download files from | Ultralytics servers. Possibly for a good reason, but most likely | to, let 's say, "pull an Oracle." | | Serious AI researchers won't go anywhere near this stuff, but the | number of students-slash-potential-interns with "but it's on | GitHub!" expectations that I had to reject lately due to "nope, | we're not paying these guys for their Enterprise license just to | check out your project" is rather disheartening... | donretag wrote: | Since Open Source has been established in the tech ethos for a | while now, any deviation has been met with derision. It seems | like the community has been more tolerant of these "open" | licenses as of late. While must of the hate for projects that do | not fit the FOSS standard is mostly unwarranted, hopefully we are | not moving quickly in the "open" direction. | | Here is another article on LLaMa2: | https://opensourceconnections.com/blog/2023/07/19/is-llama-2... | blueblimp wrote: | What's problematic is that there are big models that adopt truly | open source licenses, such as MPT-30b and Falcon-40b. As grateful | as I am for having access to the Llama2 weights, it feels unfair | that it gets credit for being "open source" when there are | competing models that really are open source, in the traditional | OSI sense. | | The practical difference between the licenses is small enough | that I expect most people (including me) will choose Llama2 | anyway, because the models are higher quality. But that incentive | may mean that we get stuck with these awkward pseudo-open | licenses. | indus wrote: | No wonder there is such "momentum" on watermarking. | sytse wrote: | Great point in the article. In | https://opencoreventures.com/blog/2023-06-27-ai-weights-are-... I | propose a framework to solve the confusion. From the post: "AI | licensing is extremely complex. Unlike software licensing, AI | isn't as simple as applying current proprietary/open source | software licenses. AI has multiple components--the source code, | weights, data, etc.--that are licensed differently. AI also poses | socio-ethical consequences that don't exist on the same scale as | computer software, necessitating more restrictions like | behavioral use restrictions, in some cases, and distribution | restrictions. Because of these complexities, AI licensing has | many layers, including multiple components and additional | licensing considerations." | [deleted] | danShumway wrote: | > For the foreseeable future, open source and open weights will | be used interchangeably, and I think that's okay. | | This is a little weird given that directly above, the author puts | LLaMA into the "restricted weights" category. Even by the | definition the author proposes, LLaMA 2.0 isn't open source; we | shouldn't be calling it open source. | | If open source in the LLM world means "you can get the weights" | and doesn't imply anything about restrictions on their usage, | then I don't think that's adapting terminology to a new context, | I think it's really cheapening the meaning of Open Source. If you | want to refer to specifically "open weights" as open source, I'm | a bit more sympathetic to that (although I don't think it's the | right terminology to use). But I see where people are coming from | -- I'm not too put off by people using open source to describe | weights you can download without restrictions on usage. | | But LLaMA is not open weights. It's a closed, proprietary set of | weights[0] that at best could be compared to source available | software. | | It is deceptive for Facebook to call LLaMA open source, and we | shouldn't go along with that narrative. | | [0]: to the extent weights can be copyrighted at all, which I | would argue they can't be copyrighted, but that's another | conversation. | FanaHOVA wrote: | Author here. I agree with you. LLaMA2 isn't open source (as my | title says, the HN one was modified). My point is that the | average person will still call it "open source" because they | don't know any better, and it's hard to fix that. Rather than | just saying "this isn't open source", we should try to come up | with better terminology. | | Also, while weights usage might be restricted, it's a very big | compute investment shared with the public. They use a 285:1 | training tokens to params ratio, and the loss graphs show the | model wasn't yet saturated. This is valuable information for | other teams looking to train their own models. | | LLaMA1 was highly restrictive, but the data mix mentioned in | the paper led to the creation of RedPajama, which was used in | the training of MPT. There's still plenty of value in this work | that will flow to open source, even if it doesn't fit in the | traditional labels. | danShumway wrote: | Thanks for replying! And agreed on the title change; I think | your original title is much, much better phrased and I'm | sorry that I glossed over it when reading the article | (although I'm not sure "doesn't matter" fully captures the | distinction you're making here) -- mods probably shouldn't | have changed it. | | > There's still plenty of value in this work that will flow | to open source, even if it doesn't fit in the traditional | labels. | | That is a good point; the fight over what is open source and | what is source available can get heated, and part of that is | a defense against the erosion of the term. But... in general | source available is better than closed source software. And | LLaMA 2 is a significant improvement over LLaMA 1 in that | regard, it really is. So I don't necessarily want to be down | on it, in some ways it's just backlash of being tired of | companies stretching definitions. But they're doing a thing | that will absolutely help improve open access to LLMs. | | I'm always a little bit torn about how to go about this kind | of criticism of terminology, and I'm not trying to say that | people shouldn't be excited about LLaMA 2. But the way it | works out I'm often playing word police because the erosion | of the term does make it harder to refer to models with | actual open weights like StableLM. Facebook deserves real | praise for releasing a model with weights that can be used | commercially. It doesn't deserve to be treated as if what | it's doing is equivalent to what StabilityAI or RedPanda is | doing. | | I do like your terminology of "open weights" and "restricted | weights", and I wouldn't be opposed to even breaking that | down even further, I think there's a clear difference between | LLaMA 1 and 2 in terms of user freedom, so I'm not opposed to | people trying to distinguish, just... it's not hitting the | bar of being open weights. | | It's a bit like if the word vegetarian didn't exist, and if | everyone argued about how it's unhelpful to say that drinking | milk isn't vegan because it's still tangibly different from | eating meat. On one hand I agree, but on the other hand it's | better to have another category for it that means "not vegan, | but still not eating meat." There is an actual danger in | blurring a line so much that the line doesn't mean anything | anymore, and where people who mean something more rigorous no | longer have a term to communicate amongst themselves. If | average people get bothered by throwing LLaMA 2 into the | "restricted weights" category, it's better to introduce | another category between restricted and open that means | "restricted but not commercially". | | Beyond that though... yeah, I agree. I don't really have a | problem with people calling open weights open source, my only | objection to that is kind of technical and pedantic, but I | don't think it causes any actual harm if someone wants to | call StableLM open source. | pk-protect-ai wrote: | llama2 is absolutely useless. From the small models the | guanaco-33b and guanaco-65b are the best (though they are derived | from llama). | Oranguru wrote: | Useless for what? Are you comparing the base model with chat- | tuned models? | | Chat-tuned derivatives of LLaMa 2 are already appearing. Given | that the base LLaMa 2 model is more efficient than LLaMa 1, it | is reasonable to expect that these more refined chat-tuned | versions of the chat-tuned versions will outperform the ones | you mention. | monlockandkey wrote: | wait for the tuned models | ngai_aku wrote: | Is that just based on your experience, or do you have a link to | benchmarks? | pk-protect-ai wrote: | Try these prompts with different models. LLaMA 2 output is | pure garbage: ----1---- On a map sized (256,256), Karen is | currently located at position (33,33). Her mission is to | defeat the ogre positioned at (77,17). However, Karen only | has a 1/2 chance of succeeding in her task. To increase her | odds, she can: 1. Collect the nightshades at position | (122,133), which will improve her chances by 25%. 2. Obtain a | blessing from the elven priest in the elven village at | (230,23) in exchange for a fox fur, further increasing her | chances by additional 25% Foxes can be found in the forest | located between positions (55,33) and (230,90). | | Find the optimal route for Karen's quest which maximizes her | chances of defeating the ogre to 100%. ----2---- Write a | python code using imageio.v3 to create a PNG image | representing the map way-points and the route of Karen in her | quest, each way-point must be of a different color and her | path must be a gradient of the colors between the waypoints. | ------------ | | I have a lot of cases those I test against different models | ... GPT-4 since one week is really degraded, GPT-3.5 became a | little bit better, and LLaMA2 is garbage. | bloppe wrote: | Why not just "downloadable"? It describes the actual difference | between LLaMA and GPT. Open-data is the only other distinction | that matters. | rvz wrote: | Yes (Unfortunately). But Llama 2 being released for free as a | downloadable AI model is much better than nothing. For now it is | a great start against the cloud-only AI models. | | As for terms, we'll settle on '$0 downloadable AI models' which | are available today. Would rather use that over cloud-only AI | models which can fall over and break your app at any time and you | have zero control over that. | | Stable Diffusion is a good example that fits the definition of | 'open-source AI' as we have the entire training data, weights | reproduciblity, etc and Llama 2 does not. | FanaHOVA wrote: | Agreed. I called it a "$3M of FLOPS donation" by Meta. | throwuwu wrote: | Should be good motivation to figure out what those numbers mean | mk_stjames wrote: | In the diagram, there is theoretically another category outside | the 'Restricted Weights' but maybe less than the 'Completely | Closed' superspace, and that would be something along the lines | of 'Blackbox weights and model' that is free to use but | essentially non inspectable or transferrable. This would be the | sister to 'free to use' closed-source software. An AI that is | free to use but provided as a binary blob would meet this | criterion. Or a module importable to python that calls | precompiled binaries for the inference engine + weights with no | source available. The traditional complement of this in the | current software world would be Linux drivers from 3rd parties | that are not open source. They are free, but not open. | | We haven't seen this too much yet in the AI world, as mostly | people who open the weights are doing so in a research manner, | where the inference is decidedly needed to be open sourced- and | people with closed models do so in order to make money and thus | no reason to open source the inference side either, just charge | for an API ("OpenAI"). | FanaHOVA wrote: | Yea I didn't include it, but that'd be the "free as in beer, | but not freedom" circle :) | rapatel0 wrote: | Fully reproducible model training might simply not be possible if | information from the training environment is not captured. In | addition to data and code you might have additional uncertainty | from: | | - pseudo/true random number generator and initialization | | - certain speculative optimizations associated with training | environments (distributed) | | - Speculative optimizations associated with model compression | | - Image decompression algorithm mismatch (basically this is | library versioning) | | - ....things I'm forgetting... | | It's just a lot of things to remember to capture, communicate, | and reproduce. | martincmartin wrote: | _pseudo /true random number generator and initialization_ | | It's not just the generator and initialization. If you do | anything multithreaded, like a producer/consumer queue, then | you need to know which pieces of work went to which thread in | which order. | | It's a lot like reproducing subtle and rare race conditions. | monocasa wrote: | Most of the mature ML environments are pretty focused on | reproducible training though. It's pretty necessary for | debugging and iteration. | taneq wrote: | There's "open source" in the original sense, where the source was | available. Then there's "FOSS" where the source is not only | available, but it's under a copyleft license designed to protect | the IP from greedy individual humans. And then there's "open" in | the Shenzhen sense where you can find the source and other data | online and nobody's going to stop you building something based on | those. This is an interesting timeline. | pzo wrote: | On top of that there are also different OSS such as Apache and | MIT that the latter one can still restrict the user from using | because project owner might patented some algorithm and MIT | license doesn't have patent grant. | | LGPL3.0 also pretty much is restricted in a way that not sure | if can be used to distribute software in App Store for iOS | legally. | risho wrote: | The original sense of open source is defined by the people who | fractured off from the Free Software movement in the mid 90's | and created it. It's just "Free Software" that has a focus on | practicality and utility rather than "Free Software"'s focus on | idealism and doing the right thing. It has NOTHING to do with | "source available" which is a movement that has recently been | co-opting the open source name. | | "FOSS" has absolutely no requirement of it being copyleft. The | MIT license is just as FOSS as the GPL. Many of the free | software advocates do have an affinity for copyleft, but they | are not mutually exclusive. There are plenty of FOSS advocates | who also use and advocate for permissive licenses as well. | jordigh wrote: | > There's "open source" in the original sense | | That original sense never existed. Virtually nobody said "open | source" before OSI's 1998 campaign for "Open Source", as | bankrolled by Tim O'Reilly. | | https://thebaffler.com/salvos/the-meme-hustler | | I know it's been a long time, and we've forgotten, but there is | virtually no record of anyone saying "open source" before 1998, | except in rare and obscure contexts and often unrelated to the | modern meaning. | teddyh wrote: | There's this one from September 10th, 1996, which I find | intriguing: | | https://web.archive.org/web/20180402143912/http://www.xent.c. | .. | hiatus wrote: | > And then there's "open" in the Shenzhen sense where you can | find the source and other data online and nobody's going to | stop you building something based on those. | | I believe there is a name for that: gongkai. | https://www.bunniestudios.com/blog/?page_id=3107 | taneq wrote: | Ooh, thanks! I've watched a few of bunnie's things in the | past but that's a term I'll remember. | failuser wrote: | Of course, it's not open source. With proliferation of the cloud, | software has obtained an entirely new level of closeness: not | being able to see the program binaries. Having an ability to run | locally is now somewhat open in comparison. | Eduard wrote: | An understood term like "open open" source shouldn't be | hijacked and exploited for marketing purposes. | | What these models do, they should either invented a new term, | or use an appropriate existing term, eg. "fair use" | failuser wrote: | Absolutely. Maybe the term is already coined, but I don't | know it. Open source implies the ability to compile software | from human-generated inputs. This is just self-hosted | freeware. | jerf wrote: | This isn't really new, the strict "Open Source" as defined for | software has never made exact, perfect sense for anything other | than software. That's why the Creative Commons licenses exist; | putting a photographic image under GPL2 has never made any sense. | It always needs redefinition in new media. | alerighi wrote: | Even for medias such as photos, songs, videos, you have a | source. That is the raw materials and the projects from which | you rendered the image, the video or the audio output. | | The source of a language model is more in reality the model, | that is the code that was used to train the particular model. | The model itself is more of a compiled binary, altough not in | machine code. | | So for a model to be really open source to me it would mean | that you have to release the software used for generating it, | so I can modify it, train it on my data, and use it. | hardolaf wrote: | The strict "Open Source" wasn't even a definition when I | started college. | not2b wrote: | An LLM is more like software than it is like media. The GPL | defines source code as the preferred form for making | modifications, including the scripts needed for building the | executable from source. The weights in this case are more | similar to the optimized executable code that comes out of a | flow. The "source" would be the training data and the code and | procedures for turning that into a model. For very large LLMs | almost no one could use this, but for smaller academic models | it might make sense, so researchers could build on each others' | work. | RobotToaster wrote: | Creative commons has never claimed to be an open source licence | though, they usually use the term free culture. | Flimm wrote: | It doesn't need redefinition. We just need a new term for new | media. | curtis3389 wrote: | Part of the benefit of FOSS & open source is that a curious user | can inspect how something is made and learn from it. It matters | that open weights are no different from a compiled program. Sure, | you can always modify an executable's instructions, but there's | no openness there. | | Then there's the problems of the content of the training data, | which parallel the dangers of opaque algorithms. | morpheuskafka wrote: | The chart in this this article is very wrong to show only GPL as | free software and MIT/Apache as open source but not free software | licenses. | | While the FSF side of things doesn't like the term "open source," | even they say that "nearly all open source software is free | software." Specifically, the MIT and Apache (and LGPL) licenses | are absolutely free software licenses--otherwise Debian, FSF- | approved distros, etc. would have far less software to choose | from. | | What the chart probably meant to distinguish is copyleft vs free | software or open source. And if you're ordering it from a | permissiveness viewpoint, the subset relationship should be | reversed--GPL is far more permissive than SSPL, etc., but still | less permissive that MIT/Apache. | skybrian wrote: | I don't see why the term "open source" needs to evolve when | "source available" is available. Or in this case, "weights | available under a license with few restrictions." | mhh__ wrote: | New generation of programmers can't remember not having open | source / free software of any kind so the difference is | academic versus felt. | flir wrote: | "Nyet! Am not open source! Not want lose autonomy!" | | (Downvotes... oops. The reference is Charlie Stross's | Accelerando. The protagonist has a conversation with an AI that's | just trying to survive. One of the options he suggests is to open | source itself. Which is a roundabout way of saying that | _eventually_ we 're going to have to take the AI's own opinions | into account. What if it doesn't want to be open source?) | Havoc wrote: | It is quite an unfortunate dilution of the term | arikanev wrote: | How is it possible that you can fine tune Llama v2 but the | weights are not available? That doesn't make sense to me. | godelski wrote: | The headline is editorialized. Actual is "LLaMA2 isn't "Open | Source" - and why it doesn't matter" | | It is actually editorialized in a way that feels quite different | from the actual one. I think the author and the poster might | disagree on what open source means. | swyx wrote: | they are the same person :) | FanaHOVA wrote: | Mods changed the title, I used the original one when first | posting. Not sure why they changed it. | Der_Einzige wrote: | Given that it's basically impossible to prove that a particular | text was generated using a particular LLM (and yes, even with all | the watermarking tricks we know of, this is and will still be the | case), they might as well be interchangeable. Folks can and will | simply ignore the silly license BS that the creators put on the | LLM. | | I hope that users aggressively ignore these restrictive licenses | and give the middle finger to greedy companies like Facebook who | try to restrict usage of their models. Information deserves to be | free, and Aaron Swartz was a saint. | api wrote: | I'm not sure open source applies to actual models. Models aren't | human readable, so it's closer to a binary blob. It would apply | to the training code and possibly data set. | | Llama2 is a binary blob pre-trained model that is useful and is | licensed in a fairly permissive way, and that's fine. | politelemon wrote: | Yes I think you've put it well. If models were smaller I'd see | those in the Github releases section. The model training is | what I'd see in the source code and the README etc, to arrive | at the 'blob'. | api wrote: | Even if it costs millions in compute to run at that scale, | seeing that code would be extremely informative. | cjdell wrote: | Very like a binary blob. You have to execute it to use it and | impossible for humans to reason about just by looking at it. | | At least binary blobs can be disassembled. ___________________________________________________________________ (page generated 2023-07-21 23:01 UTC)