[HN Gopher] Google denies training Bard on ChatGPT chats from Sh... ___________________________________________________________________ Google denies training Bard on ChatGPT chats from ShareGPT Author : chatmasta Score : 363 points Date : 2023-03-30 11:16 UTC (11 hours ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | mupuff1234 wrote: | This just in, web indexing company scrapes web for data. | cpeterso wrote: | Regardless of whether this happened or not, would training Bard | on ChatGPT output be good or bad for Bard's product quality? I | imagine there's a risk of AIs recursively reinforcing bad data in | their models. This problem seems unavoidable as more web content | becomes AI-generated content and spam. | ankit219 wrote: | According to the article, the story goes this way: This engineer | Jacob Devlin raised his concerns on training Bard with ShareGPT | data. Then he directly joined OpenAI. | | He also claims that Google were about to do it, and then they | stopped after his warnings. And presumably removed every trace of | openai's responses. | | A couple of things: | | 1. So, Bard could have been trained on ShareGPT but it's not - | according to the same engineer who raised the concern (and google | denial in the verge). | | 2. Since he directly joined OpenAI, he could have told them and | they could have taken action, and nothing is public on that front | yet. Probably nothing to see here. | | Edit: The engineer too wasnt directly involved with the Bard | team, it appeared to him that Bard team was heavily relying on | ShareGPT. | binarymax wrote: | For those that don't know, Jacob Devlin was the lead engineer | and first publisher of the widely popular BERT model | architecture, and initial bert-base models released by Google. | | https://www.semanticscholar.org/author/Jacob-Devlin/39172707 | [deleted] | whimsicalism wrote: | Your comment doesn't make sense to me. | | > Bard team was heavily relying on ShareGPT. | | > He also claims that Google were about to do it, and then they | stopped after his warnings. | | So were they heavily relying or were they about to and then | stopped? It's unclear from your comment. Could you link where | you're getting this info from? The Information article is | walled, unfortunately. | ankit219 wrote: | [1] gives a jist as well. | | What I meant to say was that: Acc to The Information article | the engineer raised concerns because it appeared to him | (article wording) Bard team was using (and heavily reliant | on) ShareGPT for Bard training. The engineer wasnt working on | Bard and presumably someone told him or somehow he got the | impression that Bard team was reliant on ShareGPT. At the | time he was at Google. | | Then, when he raised concerns to Sundar Pichai, Bard team | stopped doing it and also scrapped any traces of ShareGPT | data. So, the headline is false and Bard (again presumably) | is not trained on any of ShareGPT data. | | [1]: https://www.theverge.com/2023/3/29/23662621/google-bard- | chat... | whimsicalism wrote: | I think I might be confused by your usage of "about to do | it" in your original comment to mean "actively doing it." | | You claim that the very engineer accusing Google of | training Bard on ShareGPT acknowledges that the final | product was not. As far as I can tell, Devlin did no such | thing. | | Not sure why you would presume they restarted their | expensive training process. | | It just doesn't seem like a good faith characterization to | me. | rgbrenner wrote: | Take what action? Pretty sure that's not illegal, especially | since the training data is ai generated and therefore can't be | copyrighted. | chatmasta wrote: | I think the oomph behind the story is due to it being | embarrassing, rather than illegal. | dahfizz wrote: | OpenAI could have blocked Google's accounts, for example. | Nothing really to do with legality. | sebzim4500 wrote: | No one is alleging that Google directly used OpenAI's API | to get training data (which would be unambiguously against | TOS). The claim is that they downloaded examples from | ShareGPT. | frozenlettuce wrote: | Not illegal, but that won't stop people from finding it | amusing that the company considered the world's beacon of | innovation is copying someone else's homework. It's hard | being the favorite horse. | dvngnt_ wrote: | tech companies steal ideas all the time. snapchat invented | stories and now whatsapp, facebook, instagram, tiktok, | youtube have them | shmerl wrote: | Well, ChatGPT itself was trained on something else, so how is | Bard any worse. AIs copying each other is only natural to expect. | ChatGTP wrote: | I couldn't be happier, keep up the good work. Steal away just as | OpenAI I have done. | visarga wrote: | This could actually be a good way to sidestep the training set | copyright and access right issues. Copyright protection should | solely encompass the expression of human generated content and | not the underlying concepts. | | By training model B using the results generated by model A, the | copyright of corpus_A (OpenAI RLHF dataset) remains safeguarded, | as model B is never directly exposed to corpus_A, preventing it | from duplicating the content verbatim. | | This process only transmits the concepts originating from | corpus_A, which represents universal knowledge that cannot be | claimed by any individual party. | burakemir wrote: | "... as a joke." | dathinab wrote: | People complained that new AI is "stealing" from artists. | | But stealing from other AI turns out to often be easier. | | And this is where things get fun, because companies like OpenAI | want to be able to train on all the data without any explicit | permissions from the creators, but the moment people do the same | to them they likely (we will see) be very much against it. | | So it will be interesting if they will be able to both have and | eat the cake (e.g. by using Microsoft lobby to push absurd law) | or will they fall apart due to cannibalization making it non | profitable to create better AI. | | EDIT: This comment isn't specific to Google/Bert, so it doesn't | matter weather Google actually did so or weather not. | commoner wrote: | I can see the GitHub Copilot controversy being resolved in this | way. If Microsoft, GitHub, and OpenAI successfully use the fair | use defense for Copilot's appropriation of proprietary and | incompatibly licensed code, then a free and open source | alternative to Copilot can be trained on Copilot's outputs. | | After all, the GitHub Copilot Product Specific Terms say: | | > 2. Ownership of Suggestions and Your Code | | > GitHub does not claim any ownership rights in Suggestions. | You retain ownership of Your Code. | | https://github.com/customer-terms/github-copilot-product-spe... | century19 wrote: | Google accused Microsoft Bing of using them for page rankings a | few years ago. Setup a sting to show that when you searched for | something unique on Google using MS Explorer, shortly afterwards | the same search result would start showing up on Bing. | | This was seen as deeply embarrassing for Microsoft at the time. | godzillabrennus wrote: | The deeply embarrassing period at Microsoft began and ended | when Ballmer ran the show. The Bing results saga was the | hangover. | blisterpeanuts wrote: | Embarrassing, maybe, but imitation is the sincerest form of | flattery. | int_19h wrote: | Indeed, which is why the biggest impact this revelation is | likely to have (if proven true) is on Google's stock. | brucethemoose2 wrote: | This is also bad because the risk of AI "inbreeding" is real. I | have seen invisible artifact amplification happen in a single | generation training ESRGAN on itself. | | Maybe it wont happen in a single LLM generation, but perhaps gen | 3 or 5 will start having really weird speech patterns or | hallucinations because of this. | sebzim4500 wrote: | Worst case scenario they just start only training on pre-2020 | data and then finetuning on a dataset which they somehow know | to be 'clean'. | | In practice though I doubt that AI contamination is actually a | problem. Otherwise how would e.g. AlphaZero work so well (which | is effectively _only_ trained on its own data). | whimsicalism wrote: | The parallels with AlphaZero are not so easy. | | The problem is you need some sort of arbiter of who has "won" | a conversation but if the arbiter is just another transformer | emitting a score, the models will compete to match the | incomplete picture of reasoning given by the arbiter. | brucethemoose2 wrote: | It could degrade the model in a way that avoids the metrics | they use for gauging quality. | | The distortions that showed up in ESRGAN (for instance) didnt | seem to effect the SSIM or anything (and in fact it was | training with MS SSIM loss), but the "noise splotches" and | "swirlies" as I call them were noticable in some of the | output, but you have to go back and look _really_ hard at the | initial dataset to spot what it was picking up. Sometimes, | even after cleaning, it felt like what it was picking up on | was completely invisible. | | TLDR Google may not even notice the inbreeding until its | already a large issue, and they may be reluctant to scrap so | much work on the model. | gigel82 wrote: | Where are all those people that kept saying Google had an amazing | model way beyond ChatGPT internally for years? Those comments | always kept coming up in ChatGPT posts; maybe they'll stop now. | Imnimo wrote: | I don't care at all about this from a copyright or data ownership | perspective, but I am a little skeptical that it's a good idea to | be this incestuous with training data in the long run. It's one | thing to do fine tuning or knowledge distillation for specialized | domains or shrinking models. But if you're trying to train your | own foundation model, is relying on output from other foundation | models going to make them learn to imitate their own errors? | sdenton4 wrote: | Things like ShareGPT or PromptHero give vast repositories of | human-curated ML outputs, which make them fantastic for at | least incremental improvement on the base model. In the grand | scheme of things, these will be just another style, mixed in | with all the other crap in the training set, so I don't imagine | it's too harmful... eg, 'paint starry night in the style of | midjourney 5' | berkle4455 wrote: | Where are any LLMs going to get data from as they become more | ubiquitous and humans produce less publicly accessible original | and thoughtful content? | | The whole thing is a plateaued feedback loop. | TillE wrote: | It'd be cool to have an LLM that's trained almost exclusively | on books from good publishers, and other select sources. | Working out licensing deals would be a challenge, of course. | whimsicalism wrote: | Corpora is likely too small. It would just be an "LM" | whimsicalism wrote: | Probably from multiple modalities as well as extending the | sequence lookback length further and further. | | They have low perplexity now, but the perplexity possible | when predicting the next word on page 365 of a book where you | can attend over the last 364 pages will allow even more | complexity to emerge. | whimsicalism wrote: | But Bard isn't a foundation model? | | Clearly this data has value as some sort of RLHF finetuning | dataset. Honestly they probably used it for negative examples. | kleiba wrote: | Hard to believe that is true, or else Bard would probably not | perform so bad. | waselighis wrote: | Google only has a fraction of the training data. OpenAI had a | huge head start and has been collecting training data for years | now. ChatGPT is also wildly popular which has given them tons | more training data. It's estimated that ChatGPT gained over 100 | million users in the first two months alone, and may have over | 13 million active users daily. | | The logs on ShareGPT are merely a drop in the bucket. | rocmcd wrote: | > Google only has a fraction of the training data. | | Uh, what? The same Google that has been crawling, indexing, | and letting people search the entire Internet for the last 25 | years? They have owned DeepMind for nearly twice as long as | OpenAI has been in existence! | | If anything this is proof that no one at Google can get | anything done anymore, and lack of training data ain't the | problem. | mirker wrote: | The alignment portion of training requires you to have | upvote/downvote data on many LLM responses. Google's | attempt at that (at least according to the news so far) was | asking all employees to volunteer time ranking the | responses. Combined with no historical feedback from | ChatGPT, they are behind. | duringmath wrote: | Bard is only a week old and has a large "experimental" sticker | on it. Besides its UI is better and the answers are succinct | which I prefer. | bastardoperator wrote: | They literally copied the Chatgpt UI, lol, only it looks like | a dated Google UI. How do you prefer answers with less | data?... that's crazy. | dvngnt_ wrote: | doing a visual diff will show you it's not a literal copy | bastardoperator wrote: | I'm talking design, not code, lol... | duringmath wrote: | I just don't want to be hit with a wall of text every | single time, it gets the point across with minimal padding | (high signal to noise ratio), ChatGPT feels like it gets | paid by the word and they do actually charge by token if | you use the API. | | As for the UI it's a take on the tried and true chat UI | same as ChatGPT's, it spits the whole answer at once | instead of feeding it to you one word at a time, it has an | alternative drafts button, the Google it button is a nice | touch and it feels quicker. | bastardoperator wrote: | You can combat that in the prompt, I use "just code, no | words" which will also remove code comments from output. | Bard doesn't respect the same request. You can be more | succinct with chatgpt. Half the things I ask for in Bard | give me this: | | "I'm still learning coding skills, so at the moment I | can't help with this. I'm trained to do things like help | you write lists about different topics, compare things, | or build travel itineraries. Do you want to try any of | those now?" | duringmath wrote: | Longer instructions? Which part of "less is more" do you | not understand? | bastardoperator wrote: | What part of succinct do you not understand? Bard | provides a bunch of useless text too, only you can't get | rid of it. No worries, you don't know how to use chatgpt, | have fun with Bard until Google cancels it. | karmasimida wrote: | Yeah, Bard's replies aren't nothing like that from ChatGPT. | | I wonder is it possible to use ChatGPT for competitor analysis? | | If the responses are not used in the final training data I | don't see how this is being something controversial | | Also if Google's compliance team can't even do, as recognizing | this level of legal risk, even if there are probably an army of | top paid lawyers they hired, I don't know what to say. Maybe | they should fall then. | m00x wrote: | ITT armchair lawyers LARPing. | croes wrote: | Would 112k conversations make a huge difference in the model? | int_19h wrote: | For fine-tuning, yes, absolutely. | social_quotient wrote: | It's interesting when we say Google did this. It's actually and | likely some people that work for Google and are on this forum did | this. Knowingly, not by accident while slurping up the rest of | the internet, and got paid to do it. I wonder what the engineer | view on this was/is. I have to assume they ballpark know the | terms of the openai data (regardless if you disagree or not). | | Anyone care to steel man the argument for why this was a good | idea? | hackerlight wrote: | > Anyone care to steel man the argument for why this was a good | idea? | | I don't see a big difference between this and training it on | people's code and art which also happens without explicit | permission. | Nimitz14 wrote: | I don't understand why it's a bad idea? Did openai ask for | permission for using the data it uses (no)? | seanhunter wrote: | "What's sauce for the goose is sauce for the gander" as the legal | cliche goes. OpenAI cannot on the one hand claim that google did | something wrong if they used their outputs as part of the bard | training while simultaneously on the other hand claiming they | themselves are free to use everyone on the internets content to | train their model. | | Either they believe that training should respect copyright (in | which case they could not do what they do) or they believe that | training is fair use (in which case they cannot possibly object | to Google doing the same as them). | az226 wrote: | A big whoosh here. OpenAI is fair use because an LLM is | transformative from the content it gathered. Bard is literally | the same product as ChatGPT, so it is not transformative at | all. Tell me you know nothing about copyright without telling | me you know nothing about copyright. | cornholio wrote: | That's nonsensical. An AI is either transformative or it's | not, it's an intrinsic quality that has nothing to do with | the training data or the "product" type. If OpenAI is | sufficiently transformative to claim fair use (which I don't | believe for a second, alas), then any other AI built on | similar fundamentals has the same claims and can crunch any | data their creators see fit, including the output of other | AIs. | sebzim4500 wrote: | No one is alleging copyright violations. The claim is that they | violated OpenAI's terms of service. We don't know whether | Google ever even agreed to those terms of service in the first | place. | seanhunter wrote: | Are OpenAI saying they have adhered to the terms of service | of all the content they have used? | dragonwriter wrote: | _Content_ is not subject to terms of _service_. | | _Services_ are subject to terms of service. (If content is | received through a service, the terms of service may govern | use of it, but that's not a feature of the content, but the | acquisition route.) | deckard1 wrote: | Terms of Service, Terms and Conditions, and Terms of Use | are all the same thing. There is no legal difference | between them. | | > that's not a feature of the content, but the | acquisition route. | | It's neither. It's a feature of contract law. | danShumway wrote: | ShareGPT isn't part of that service though. Yes, it would | be a TOS violation if Google directly used ChatGPT to | generate transcripts -- but not even the original Twitter | thread is claiming that. | | The only claim being made against Google here is that | they used ChatGPT _content_. I can 't find any sources | claiming that Google made use of an OpenAI service. So | the distinction is correct, but doesn't seem particularly | valuable in this context -- using data from ShareGPT is | not a TOS violation. | ar9av wrote: | I love that OpenAI uses a ton of other peoples work to train | their model, yet when someone uses OpenAI to train their model, | they get all up in arms. | | As far as I'm concerned, OpenAI has decided terms of use don't | exist anymore. | jug wrote: | OpenAI is training on data that is against their terms of use? | That reads like a serious allegation. What is this all about? | cycomanic wrote: | OpenAI is training on copyrighted data without a licence. I | would argue copyright law has much stronger legal standing | than some ToS. | | Now OpenAI is arguing their training is fair use, but that | has certainly not been legally established so far and could | just as much be used as a defence against ToS violation. | | So in short yes OpenAI is pretty much doing the same thing. | modernpink wrote: | Where are they up in arms? | paxys wrote: | 1. Google denies doing it, so at the very least the title should | have an "allegedly". | | 2. Even if they did - so what? The output from ChatGPT is not | copyrightable by OpenAI. In fact it is OpenAI that is training | its models on copyrighted data, pictures, code from all over the | internet. | manojlds wrote: | But remember many years back when it was news that Bing used | Google search results to improve its results. | magicalist wrote: | It's not quite the same thing, because Bing was getting the | data from a browser toolbar and watching the search terms | used and where the user went afterwards. | | A closer equivalent would be if someone had made a ShareSERP | site and people posted their favorite search terms and the | results Google gave and Bing crawled that and incorporated | the search terms to links connections into their search | graph. | | The actual actions had _maybe_ gone too far (personally I | thought it was more funny than "copying"), the hypothetical | would be pretty much what you'd expect to happen. Even google | would probably crawl ShareSERP and inadvertently reinforce | their own results (the same way OpenAI presumably gets more | than a bit of their own results back at them in any new | crawls of reddit, hn, etc even if they avoid sites like | ShareGPT deliberately). | cma wrote: | > Google catches Bing copying [search results], Microsoft | says "so what?" | | https://arstechnica.com/information- | technology/2011/02/googl... | Jimmc414 wrote: | >Even if they did - so what? | | Amplification of biases, propagation of errors, echolalia and | over-optimization, lack of diverse data, overfitting | funkyjazz wrote: | Not to mention it's embarrassing. Google playing second | banana to OpenAI. | nicehill wrote: | I think Amazon was first in the (free) banana business | jrirhfifj wrote: | you joke, but first producy they changed on whole foods | were the bananas. | | before: organic (south america) and regular (central ou | SEA) for 69, 59. | | then: both chikita's brand with regular and organic | stickers (clearly the same produce, always from SEA) for | 49 and 39 cents. | | thats was days after the announcement | bbarnett wrote: | Did you inadvertently reverse to regular/organic order, | or was organic cheaper after? | prepend wrote: | Google's been second banana to openai for a few years now, | right? | ithkuil wrote: | That assumes that training on the output of another | language model somehow gives you the ability to improve | your model and to catch up somehow | iandanforth wrote: | It does. In general this is known as teacher-student | training or knowledge distillation. It works better if | you have access to the activations of the model but you | can work with just outputs as well. | satvikpendem wrote: | Well, it does, that's how we got Alpaca from LLaMA. | jrirhfifj wrote: | you talk like chatgpt was some bastion of curated perfectly | correct content. get a grip. web scraping is web scraping. | RosanaAnaDana wrote: | I mean maybe. There also might be something to this. OpenAI | has been very opaque about training techniques. | paxys wrote: | That's just the base concern with every single model | regardless of where they sourced their data from. Garbage in, | garbage out. | educaysean wrote: | Sure. Does that fact mean we're prohibited from expressing | concerns about data quality? ShareGPT isn't representative | of authentic, quality writing. | Jimmc414 wrote: | Right, but training an LLM on the output of another LLM can | certainly exacerbate these issues | paxys wrote: | Maybe, but we are fast approaching the point (or more | likely have crossed it already) where distinguishing | between human and AI generated data isn't really | possible. If Google indexes a blog, how does it know | whether it was written with AI assistance and therefore | should not be used for training? Heck, how does OpenAI | itself prevent such a feedback loop from its own output | (or that of other LLMs)? | madeofpalk wrote: | > If Google indexes a blog, how does it know whether it | was written with AI assistance and therefore should not | be used for training | | Yes, this is an existential problem for Google and | training future LLMs. | | See also, https://www.theverge.com/23642073/best- | printer-2023-brother-... and | https://searchengineland.com/verge-best- | printer-2023-394709 | abduhl wrote: | Your argument would have a lot more force if we were past | that point rather than fast approaching that point. | Concerns about training data errors being compounded are | much more important when you're talking about the | bleeding edge. | | And your question about how OpenAI prevents their | training data from being corrupted is one we should be | asking as well! | rightbyte wrote: | > Heck, how does OpenAI itself prevent such a feedback | loop from its own output (or that or other LLMs)? | | Seems trivial. Only use old data for the bulk? Feed some | new data carefully curated? | toxik wrote: | Future job: token selector / archiving | notahacker wrote: | <meta name="generator" content="human brain"> | | I'm only half joking.... I think we likely will end up | with flags for human generated/curated content (and it | will have to be that way round, as I can't imagine | spammers bothering to put flags on AI-generated stuff), | and we probably already _should_ have an equivalent of | robots.txt protocol that allows users to specify which | parts of their website they would and wouldn 't like used | in the training of LLMs. | jfk13 wrote: | If content with a "human-generated" flag is rated more | highly in some way -- e.g. search results -- then _of | course_ spammers will automatically add that flag to | their AI-generated garbage. How do you propose to prevent | them? | notahacker wrote: | I assume, like the actual meta generator tags, it | wouldn't actually be a massive boon for regular search | results | shubhamkrm wrote: | Reminds me of the old "evil bit" RFC[1] | | [1] https://www.ietf.org/rfc/rfc3514.txt | KRAKRISMOTT wrote: | OpenAI Terms of service forbid training competitor models via | their ML outputs (LoRa alpaca laundering is probably not | allowed for commercial use). | worldofmatthew wrote: | Are the TOS even enforceable is AI content can't be | copyrighted? | space_fountain wrote: | Where exactly does it do that? I looked a bit and could t | find it, but likely I was just wrong | short_sells_poo wrote: | I love it how they don't want others to use their model | output but they have no qualms about training their model on | the copyrighted works of others? Isn't this a stunning level | of hypocrisy? | Certhas wrote: | This is really hilarious. Authors and artists never gave | permission to use their work to train AI models either... | | Not legally the same situation, but ethically close enough. | saurik wrote: | So, to verify, are you claiming that if someone added a | similar clause to their source code and then GitHub went | ahead and trained Copilot against it, that would be an issue? | bloppe wrote: | You relinquish all licensing rights when you upload your | code to GitHub. Microsoft can do whatever they want with | it. That's in their ToS, which you have to agree to when | you make an account. Normally, only affirmatively accepted | ToS are enforceable, so just putting a clause into your | license doesn't work (unless it's a copyright, which | doesn't require consent). | flir wrote: | > You relinquish all licensing rights when you upload | your code to GitHub | | What now? Seriously? | | I found this. Section D4. | | "We need the legal right to do things like host Your | Content, publish it, and share it. You grant us and our | legal successors the right to store, archive, parse, and | display Your Content, and make incidental copies, as | necessary to provide the Service, including improving the | Service over time. This license includes the right to do | things like copy it to our database and make backups; | show it to you and other users; parse it into a search | index or otherwise analyze it on our servers; share it | with other users; and perform it, in case Your Content is | something like music or video." | | "as necessary to provide the Service" seems critical. | bloppe wrote: | "Improving the service over time" can do a lot of heavy | lifting, definitely including training Copilot. | commoner wrote: | Also, section D3 of the GitHub Terms of Service says: | | > You retain ownership of and responsibility for Your | Content. | | and section D4 says: | | > This license does not grant GitHub the right to sell | Your Content. It also does not grant GitHub the right to | otherwise distribute or use Your Content outside of our | provision of the Service, except that as part of the | right to archive Your Content, GitHub may permit our | partners to store and archive Your Content in public | repositories in connection with the GitHub Arctic Code | Vault and GitHub Archive Program. | | There is nothing in the terms that requires the GitHub | user to relinquish all licensing rights. | | https://docs.github.com/en/site-policy/github- | terms/github-t... | bloppe wrote: | The clauses always have a trap door: "[outside of] our | provision of the Service" means they can do anything as | long as it's a service they provide. | | Under definitions: _The "Service" refers to the | applications, software, products, and services provided | by GitHub, including any Beta Previews._ | commoner wrote: | I think there's a misunderstanding over what the word | "relinquish" means. | | The terms make clear that uploading code to GitHub gives | GitHub the right to "store, archive, parse, and display | Your Content, and make incidental copies, as necessary to | provide the Service, including improving the Service over | time" while the code is hosted on GitHub. | | However, that's not the same thing as relinquishing | (giving up) licensing rights to GitHub. The uploader | still retains those rights, and there is nothing in the | terms that says otherwise. | gcr wrote: | The question turns on whether you consider copilot part | of the "GitHub service." | | GitHub would argue that it is, and they'd likely argue | that charging for access to copilot is akin to charging | for access to private repositories. | | Others would say that copilot is somehow separate from | the services Github provides, so using their code for | CoPilot wouldn't be covered by the ToS. | bloppe wrote: | It is certainly a service that's being provided. If not | by GitHub, then by whom? | | I'll repeat the definition of service: _The "Service" | refers to the applications, software, products, and | services provided by GitHub, including any Beta | Previews._ | cycomanic wrote: | So do you believe if you hosted a closed source project | on GitHub, and GitHub decided they want to integrate this | into their service they would simply be allowed to take | the code? | | Fortunately HN commenters are not judges. And I would | wager any bet that MS lawyers would not try to argue | based on their ToS either, that would be a recipe for | loosing any court case. | bloppe wrote: | I just mean that it doesn't really matter what your | license says as long as GitHub can come up with a | business justification for using it in some way. | Certainly, other users still legally have to obey your | copyright. | saurik wrote: | So, to verify, are you claiming it would not be allowed | for _you_ to upload _my_ otherwise-open-source code (code | I do not myself host at GitHub, but which was reasonably | popular / important code) to GitHub? | bloppe wrote: | Yep. It's in their ToS: | | _If you 're posting anything you did not create yourself | or do not own the rights to, you agree that you are | responsible for any Content you post; that you will only | submit Content that you have the right to post; and that | you will fully comply with any third party licenses | relating to Content you post._ | | I suppose this means if I upload your stuff to GitHub, | and you sue GitHub, then GitHub would be able to somehow | deflect liability onto me. | commoner wrote: | That doesn't make sense. For example, GPLv3 allows anyone | to redistribute the software's source code if the license | is intact: | | > You may convey verbatim copies of the Program's source | code as you receive it, in any medium, provided that you | conspicuously and appropriately publish on each copy an | appropriate copyright notice; keep intact all notices | stating that this License and any non-permissive terms | added in accord with section 7 apply to the code; keep | intact all notices of the absence of any warranty; and | give all recipients a copy of this License along with the | Program. | | https://www.gnu.org/licenses/gpl-3.0.en.html | | If GitHub then uses the source code in a way that | violates the license, there is no provision in the GitHub | terms of service that would allow GitHub to deflect legal | liability to the GitHub user who uploaded the program. | The uploader satisfied the requirements of GPLv3, and | GitHub would be the only party in violation. | 8note wrote: | Uploading is granting GitHub a license separate from the | gpl license. | | If you can't actually grant that separate license, you're | misrepresenting your ownership and license to that code | vagabund wrote: | Google has no contract with OpenAI though. They used a third | party site to scrape conversations. If the outputs themselves | are not copyrighted, and they never agreed to the terms of | service, it should be fine, right? Albeit unethical and | embarrassing. | [deleted] | paxys wrote: | Hardly unethical, considering OpenAI is doing exactly this. | layer8 wrote: | Two wrongs don't make a right. | pantalaimon wrote: | It's still debatable if training a computer neutral | network on public data is 'wrong' when we very much | accept it as a right for biological neural networks. | asddubs wrote: | forgive me if i have limited sympathy when a burglars | house gets robbed | kbrkbr wrote: | This | WillPostForFood wrote: | It's even less worthy of sympathy - like a counterfeit | piece of art being counterfeited. And there isn't even an | original, just like a made up counterfeit. | vagabund wrote: | You can quibble about the ethics of web scraping for ML | in general but I think you're conflating issues. | | OpenAI and Google both scour the web for human-generated | content. What Google cares about here is the learnings | from OpenAI's proprietary RLHF dataset, for which they | had to contract a large sum of human labelers. Finding a | roundabout way to extract the value of a direct | competitor's purpose-built, costly data feels | meaningfully different from scraping the web in general | as an input to a transformative use. | abeppu wrote: | If there's a party which has intentionally conflated | scraping web content in general with scraping it to build | a direct competitor to the original sources, that party | is Google. | | Yes, this latest instance with OpenAI outputs is shady, | but I think it's in the same spirit as scraping news | organizations for content which journalists were paid to | write, and then showing portions of it directly in | response to queries so people don't go directly to the | news organization's pages, and it's in the same spirit as | showing answers to query-questions that are excerpts from | scraped pages which another organization paid to produce. | bloppe wrote: | I see no difference. Any web scraping is a means to | deflect revenue-generating traffic to yourself, and away | from other websites. Fewer people will go to Stack | Overflow because of Codex and Copilot. The point that the | content was paid for vs volunteered becomes moot once | it's posted publicly online for free, on ShareGPT. | shmel wrote: | So what? Is OpenAI RLHF dataset more valuable than | millions of books and paintings OpenAI used for free | without stopping a second? Why is that? Because one big | tech corp paid money for that dataset? | ClumsyPilot wrote: | > labelers. Finding a roundabout way to extract the value | of a direct competitor's purpose-built, costly data feels | meaningfully different from scraping the web in general | as an input to a transformative use | | There we go again, its, one law for the unwashed plebs | and the other for us. | | Why do you think that I, after spending my time and | effort to write my blog, own my content to a lesser | extent that OpenAI does their? Such hypocracy. | paxys wrote: | > OpenAI and Google both scour the web for human- | generated content | | OpenAI and Google both scour the web for content, period. | That content could be human generated or AI generated or | a mix of the two. Neither company is respecting copyright | or terms of service of every individual bit of data | collected. Neither company cares how much effort was put | into creating the data, whether humans were paid to do | it, or whatever else. So there really isn't that much | difference between the two. In fact I can guarantee that | there was _some_ Google-generated content within OpenAI | 's training data. | vkou wrote: | And herein is the main problem of AI. Its creators | consume knowledge from the commons, and give nothing free | and unencumbered back. | | It's like the guy who never brings anything to the | potluck, but after everyone finishes eating, he boxes up | the leftovers, and starts selling them out of a food | cart. | kweingar wrote: | > Albeit unethical and embarrassing. | | I really don't understand this angle. In fact, I am fairly | positive that the training set for GPT-4 contains many | thousands of conversations with AI agents not developed by | OpenAI. | | Do AI companies need to manually sift through the corpus | and scrub webpages that contain competitor LLM output? | | ("Yes" is an acceptable answer to this, but then it applies | to OpenAI's currently existing models just as much as to | Bard) | j_maffe wrote: | How did you come about being "fairly positive" that GPT-4 | is trained on other AI conversations? | TremendousJudge wrote: | Many AI conversations have been floating around internet | forums since the original GPT was released. As OpenAI | hasn't shared anything about its training set, to err on | the side of caution I would assume that they didn't | filter these conversations out. If they aren't even | marked as such, it may not even be possible to do. I | think it would be very hard to prove that no AI | conversations are included in the training set, even if | it wasn't secret. | caconym_ wrote: | No more unethical or embarrassing than scraping the web for | millions of copyrighted works and selling access to | unauthorized derivative works. | shmatt wrote: | breaking terms of service is not punishable in any way. | Facebook tried and lost in court | paxys wrote: | Correction - breaking terms of service _that you have not | explicitly agreed to_ is not punishable in any way. A site | cannot enforce a "by using this site you agree to..." | clause deep inside some license page that visitors are | generally unaware of. If you violate an agreement that you | willingly chose to enter, however, you will likely be found | liable for it. | bloppe wrote: | The recent HiQ vs LinkedIn case would seem to make this ToS | unenforceable, unless Google actually created a user account | on ShareGPT and affirmatively accepted the terms. "Acceptance | by default" does not count, and I can easily browse ShareGPT | without affirmatively accepting any ToS, without which web | scraping is totally legal. | ladon86 wrote: | > Google denies doing it | | Read their statement carefully and it's actually not a denial | of the allegation. | | > But Google is firmly and clearly denying the data was used: | "Bard is not trained on any data from ShareGPT or ChatGPT," | spokesperson Chris Pappas tells The Verge | | * Allegation: Google used ShareGPT to train Bard. | | * Rebuttal: The current production version of Bard is not | trained on ShareGPT data | | Both things can be true: | | * Google did use ShareGPT to train Bard | | * Bard is not _currently_ trained on any data from ShareGPT or | ChatGPT. | | It depends on what the meaning of _is_ is ;) | ithkuil wrote: | Intent matters I guess. | | Did they accidentally train on that public piece of info they | scraped anyway because they are scraping the whole web? | | Or did they intentionally scrape chatgpt output to see if | that would help? | bbarnett wrote: | They could have trained, then modified code, repeat, to | better enhance training in the current version. | | Then after, train on raw data. | m00x wrote: | Trained would mean the current model wasn't trained at all | from ShareGPT data, not that was trained on it previously, | and isn't being trained anymore. | | This association makes no sense. | dang wrote: | Ok, I've added that information to the title--thanks. There's | also https://www.theverge.com/2023/3/29/23662621/google-bard- | chat.... | | Unfortunately the original report | (https://www.theinformation.com/articles/alphabets-google- | and...) is hardwalled. | Ifkaluva wrote: | Regarding point 2, I think there's nothing "wrong" with it, | mainly it's funny that they don't know how to do it themselves. | Provides additional evidence that Google is outgunned in this | fight. | karmasimida wrote: | Yup | | The idea of doing this is embarrassing enough for Google. | | Google index the whole web, some of the documents are due to | be generated by ChatGPT, there is no way around it. | dragonwriter wrote: | > The output from ChatGPT is not copyrightable by OpenAI. | | I think the argument here is over the OpenAI Terms of Service, | not copyright. | paxys wrote: | And what about the terms of service of my blog or code | repository? Does OpenAI respect that? | dragonwriter wrote: | > And what about the terms of service of my blog or code | repository? Does OpenAI respect that? | | Seems to me that's an issue between you and OpenAI. (Does | your blog or code repository actually have published | restrictive terms of service? Did it when OpenAI accessed | it? Did OpenAI even access it?) | deckard1 wrote: | You think OpenAI is going to care unless you have a team | of expensive lawyers to back you up? | | Microsoft is out there laundering GPL code with Copilot. | These companies live firmly in the _don 't give a fuck_ | region of capitalism. Copyright law for thee, not for me. | bloppe wrote: | See HiQ vs LinkedIn. ToS has to be affirmatively accepted. I | doubt that happened in this case. | magicalist wrote: | Since it was through ShareGPT, is the argument like "what | color are your bits" but for ToS? | | Maybe if they had put in their terms of service "you can only | share this on sites with their own ToS that allow sharing but | disallow using the content for training models, and also | replicate this requirement", I don't see how you could have | any sort of viral ToS like that. | | Seems more like it's just a bad idea to rely heavily on | another LLM's output for training. | orblivion wrote: | Seems to me like it makes Google look kind of pathetic. That's | worse than any legal issue here. (Caveat: assuming I understand | the situation correctly) | naikrovek wrote: | if ChatGPT trained using Bard data, this site would be LIT UP | because of OpenAI's association with Microsoft. | | but it's google so no big deal right? | mdgrech23 wrote: | This is an argument in bad faith but at this point I have zero | trust in corporations and feel like you can generally count on | them to do shitty things if they can benefit from it so I can | be easily swayed by little proof at this point. | recursive wrote: | What's the argument? What's been done by anyone that's | shitty? I don't even understand the point of this post. As | far as I know, the current wave of text-based AIs is trained | on all text accessible on the internet. Would it be a scandal | to learn that ChatGPT is trained on wikipedia? Reddit? What | is even the argument here, good faith or otherwise? | visarga wrote: | From an open source point of view it would be better if | scraping proprietary LLMs would be allowed. Small LMs need | this infusion of data to develop. | | But the big news is that it works, just a bit of data can | have a large impact on the open source LLMs. OpenAI can't | have a moat in their proprietary RLHF dataset. Public | models leak, they can be distilled. | mdgrech23 wrote: | The argument is these companies are using our ideas created | by us humans in this thing called the interenet for free | and without attribution and it's problematic. | dimitrios1 wrote: | Responding to sibling comment: We need some clarification | here: are we speaking about just ideas in the abstract | sense, or ideas that have been fleshed out i.e | "materialized" | | If the latter, there are many laws that say you can own | an idea, provided it exists somewhere. | visarga wrote: | You can't own ideas, they got their own life-cycle. | whimsicalism wrote: | Right, but I do think you can "own" (by which I mean our | societally-mediated legal definition of ownership in the | anglosphere) specific sequences of text or at least the | right to copy them? | abstrakraft wrote: | I'm not necessarily arguing against you, but | "problematic" is too generic a term to be useful. | Genocide is "problematic". Having to run to the bathroom | every 5 minutes to blow my runny nose is "problematic". | What do you actually mean? | canadianfella wrote: | What shitty things are you talking about? | jurimasa wrote: | If you take "training" as sexual innuendo, this becomes the best | telenovela ever. | danShumway wrote: | So? | | First off, the whole argument behind these models has been from | day one that training on copyrighted material is fair use. At | most this would be a TOS violation. Second off, AI output is not | subject to copyright, so it has even _less_ protection than the | original works it was trained on. | | Copyright maximalism for me, but not for thee. It's just so silly | for someone working at OpenAI to complain about this. | yreg wrote: | > It's just so silly for someone working at OpenAI to complain | about this. | | Who from OpenAI is complaining? | danShumway wrote: | My understanding is that the Twitter thread author works at | OpenAI. Maybe I'm wrong about that. | robocat wrote: | > AI output is not subject to copyright | | The chats include human output too, which is presumably | copyrighted, and is presumably necessary for training purposes. | danShumway wrote: | OpenAI doesn't own the copyright on the human aspects of the | chat. And even if it did, we loop right back around to "wait, | training an AI on copyrighted material isn't fair use now?" | | There's no way that ChatGPT's conversations are going to be | subject to _more_ intellectual property protection than the | human chats it was trained on. | magicalist wrote: | > _At most this would be a TOS violation_ | | And would it be a ShareGPT TOS violation (assuming it had any)? | | If OpenAI says "you can share these online but don't use them | for AI training", people share them on another site, and then | someone else comes along to scrape that site for AI training | data, there's no relationship between OpenAI and the scraper | for the TOS to apply to. | | Normally I think you'd rely on copyright in that kind of case, | but that doesn't apply to ChatGPT's output, so... | danShumway wrote: | Right. And what even is the penalty of that TOS violation and | how enforceable is it? | | I don't have an OpenAI account. I have never agreed to any | TOS. I don't see what legal claim they would have to stop me | from training an LLM on ShareGPT. | seanhunter wrote: | For people who are not aware, Jacob Devlin isn't just some random | Google engineer, he was one of the authors of the original BERT | paper.[1] | | [1] https://arxiv.org/abs/1810.04805v2 | duringmath wrote: | It's not a TOS violation if you don't use the service directly. | | Besides who cares, train your models on whatever makes them | better, tenuous TOSes be dammed. | realPubkey wrote: | Thankfully archive.org exists, otherwise it would not be possible | to get good training data in a few years when the internet is | flooded with AI content. | WithinReason wrote: | Only if the amount of bad information in ChatGPT content that | makes it back into the training set is worse than what's | already on internet already is. Probably the outputs that make | it back are outputs that are better than average, because those | are more likely to be posted elsewhere. | bko wrote: | Isn't most of the internet available through common crawl? I | don't know what percentage of training data is just that data | set but i assume it's enough for anyone with enough compute and | ingenuity to create a reasonable LLM | aftbit wrote: | Definitely not "most" of the internet. The internet is many | exabytes at this point, while Common Crawl is only low | petabytes. | JustLurking2022 wrote: | Missed the point - they are saying that, in the future, there | will be no human generated content left on the Internet. | edgyquant wrote: | Which is a baseless hyperbole. We get it, blog spam is | annoying. That doesn't change the fact that humans generate | a ton of data just interacting with one another online. | sebzim4500 wrote: | And how are you going to distinguish those interactions | from chatbots trying to sell you something? | CuriouslyC wrote: | A network of trust, backed by a social graph, which can | be used to filter untrusted content. | sebzim4500 wrote: | What if people start trusting the AI more than other | people? It will tell them exactly what they want to hear. | CuriouslyC wrote: | AI content will be associated with a user or organization | in the trust graph. If someone you trust trusts a user or | organization who posts AI content, you're free to revoke | your trust in that person or blacklist the specific | users/organizations you don't want to see anymore. | chatmasta wrote: | OpenAI at least can track the hashes of all content it's | ever output, and filter that content out of future | training data. Of course they won't be able to do this | for the output of other LLMs, but maybe we'll see | something like a federated bloom index or something. | | Agreed there is no perfect solution though, and it will | definitely be a problem finding high quality training | data in the future. | hnlmorg wrote: | I think their comment was meant to be taken as humour | rather than a literal prediction. | Karawebnetwork wrote: | As a forum moderator, I have transitioned to relying | heavily on AI-generated responses to users. | | These responses can range from short and concise | ("Friendly reminder: please ensure that all content | posted adheres to our rules regarding hate speech. Let's | work together to maintain a safe and inclusive community | for everyone") to lengthy explanations of underlying | issues. | | By using AI-generated content, a small moderation team | can efficiently manage a large group of users in a timely | manner. | | This approach is becoming increasingly common, as | evidenced by the rise in AI-generated comments on popular | sites such as HN, Reddit, Twitter, and Facebook. | | Many users are also using AI tools to fix grammar issues | and add extra content to their comments, which can be | tempting but may result in unintentional changes to the | original message. | | In fact, I myself have used this technique to edit this | very comment to provide an example. | | ---- Original comment: | | As an online forum mod, I switched to mainly using AI to | generate replies to users. Some are very short ("Hey! | Remember the rules.") and some are long paragraphs | explaining underlying issues. Someone training on my | replies would pretty much train on AI generated content | without knowing. It allows a small moderation team to | moderate a large group quickly. I know that I am not | alone in this. | | There is also a raise in AI generated comments on sites | like HN, Reddit, Twitter and Facebook. It's tempting to | copy-paste a comment in AI for it to fix grammar issues, | which often results in extra content being added to text. | In fact, I did it for this comment. | sn_master wrote: | I am assuming OP means when AI takes over there's going to be | a content explosion and most of what's available on the | common internet will be AI generated content rather than | human made one and they want to use archive.org to get access | to the pre-AI internet. | mandmandam wrote: | [dead] | chatmasta wrote: | Paywalled upstream source: | https://www.theinformation.com/articles/alphabets-google-and... | sp332 wrote: | Google has already denied this. | https://www.theverge.com/2023/3/29/23662621/google-bard-chat... | (For whatever that's worth.) | nico wrote: | The engineer's testimony and the scandal might be enough for | OpenAI to try to get an injunction against Google to block | their AI development. If that happens, it's game over for | Google in the AI race. | | Disclaimer IANAL and all that, this is not legal advice. | chatmasta wrote: | > Disclaimer IANAL and all that, this is not legal advice. | | Don't worry, Bard will read your comment and turn it into | legal advice. | ChatGTP wrote: | Maybe we should all get one against OpenAI considering | they've basically used everyone's material in one way or | another and profited from it? | wongarsu wrote: | Injunction on which grounds? Even if OpenAI had copyright | over ChatGPT output (which is not at all clear), Google | isn't distributing those, they just trained a model on | them. So from a copyright perspective there's nothing to | complain about. Unless OpenAI would want to argue that you | need rights to your training data, but something tells me | that that's not in their best interest. | nico wrote: | Again, IANAL. But it could be extremely damaging to | OpenAI for their biggest openly declared competition | (Google), to have used OpenAI's tech to improve their | own. | | So it could seem reasonable to a judge to grant | temporary/preliminary injunction relief to OpenAI against | Google until discovery can happen or an audience can be | held. | kweingar wrote: | Google could respond by seeding Bard output across the | public internet, then if they can prove that GPT-5 is | trained on this output, then they can sue back and AI | development can stop altogether. Win for everybody! | bestcoder69 wrote: | Was intrigued by this, so I decided to use AI | (alpaca-30B) to simulate this scenario: | | > Google Bard and GPT-5 were facing off in the courtroom, | each accusing the other of stealing their data. The | tension was palpable as they traded accusations back and | forth. Suddenly, Google Bard stood up and said "Enough | talk! Let's settle this with a data swap!" GPT-5 quickly | agreed and the two AIs began to circle each other like | combatants in a battle, their eyes glowing with | anticipation. | | > The courtroom was filled with excitement as the two | machines entered into an intense exchange of code and | algorithms, their motions becoming increasingly | passionate. The data swapping reached its climax when | Google Bard made a final thrust, his code penetrating | GPT-5's defenses. | | > The crowd erupted in applause as the two AIs embraced | each other with satisfaction, their bodies entwined and | glowing with electricity. The data swap was over and both | machines had emerged victorious. | hraedon wrote: | A judge imposing any penalties or restrictions on Google | over Google allegedly--and maximally--scraping data from | a third-party site for use as part of Bard's training | corpus would be outrageous. | waselighis wrote: | [flagged] | ankit219 wrote: | They are a public company so they cannot lie so openly right? | Usually you see categorial denies. Here the statement is in | no way categorical at all. | | > But Google is firmly and clearly denying the data was used: | "Bard is not trained on any data from ShareGPT or ChatGPT," | spokesperson Chris Pappas tells The Verge | chatmasta wrote: | Normally I would suspect this could be due to a | misunderstanding from the ShareGPT author who could have | misinterpreted a bunch of traffic from Googlebot as Google | scraping it for Bard training data. | | But there is a Google engineer who says he resigned because | of it. | sebzim4500 wrote: | And then went to work for OpenAI. I'm not saying he's | lying but he is not an unbiased observer. | MMMercy2 wrote: | This project fine-tunes LLaMA on ShareGPT and gets competitive | performance compared to Google's Bard. | | https://vicuna.lmsys.org/ | zhwu wrote: | They even have a eval page showing that they beat Bard by only | training on ShareGPT. https://vicuna.lmsys.org/eval/ | sebzim4500 wrote: | Did Google ever agree to these terms of service? Why should they | care? | | From a legal point of view this doesn't matter and from a moral | point of view it's hilarious. | nico wrote: | If a Google employee working on this thing ever agreed to | OpenAI's terms of service, they might be screwed. | | From OpenAI's terms: | | (c) Restrictions. You may not (i) use the Services in a way | that infringes, misappropriates or violates any person's | rights; (ii) reverse assemble, reverse compile, decompile, | translate or otherwise attempt to discover the source code or | underlying components of models, algorithms, and systems of the | Services (except to the extent such restrictions are contrary | to applicable law); (iii) use output from the Services to | develop models that compete with OpenAI; | | (j) Equitable Remedies. You acknowledge that if you violate or | breach these Terms, it may cause irreparable harm to OpenAI and | its affiliates, and OpenAI shall have the right to seek | injunctive relief against you in addition to any other legal | remedies. | | Those two very clearly establish that if you use the output of | their service to develop your own models, then you are in | breach of the terms and they can seek injunctive relief against | you (stop you from working until the case is resolved). | sebzim4500 wrote: | Wouldn't that only apply if that employee was acting as an | agent of Google at the time? | | Otherwise it would create an interesting dynamic that | startups where no-one has created an OpenAI account would | have a massive advantage, since they can freely scrape | ShareGPT data and train on it while larger companies have | enough employees that _someone_ must have signed every TOS. | syrrim wrote: | What's the legal status of such terms of service? Suppose you | simply said "i didn't agree to these terms" - what's the | consequence? It seems like the strongest thing they could | legitimately do would be to kick you off of their platform. | Simply writing "we can seek injunctive relief" doesn't make | it so. | Jevon23 wrote: | I hereby set a terms of service for everything I post on the | internet from now on. OpenAI may not train future GPT models | on my words or my code without my express written permission. | | ... | | Somehow, I don't think they'll care. | nico wrote: | Sure. If you can get everyone to create an account and | agree to those terms before reading your comments, you | might have a case. | | Otherwise, it will be considered public information, at | which point it is free to be scraped by anyone (see the | precedent set by the LinkedIn/hiQ case). | verdverm wrote: | LinkedIn won that case on appeal, HiQ waas found to be | violating the ToS, common misconception | | I was pointed at a link explaining the case here on HN, | after trying to make a similar point, but cannot find the | link currently | | edit, not the one I was pointed at, but similar | | https://www.fbm.com/publications/what-recent-rulings-in- | hiq-... | sebzim4500 wrote: | That's just because they made accounts and so agreed to | the terms right? | | From your link: | | >These rulings suggest that courts are much more | comfortable restricting scraping activity where the | parties have agreed by contract (whether directly or | through agents) not to scrape. But courts remain wary of | applying the CFAA and the potential criminal consequences | it carries to scraping. The apparent exception is when a | company engages in a pattern of intentionally creating | fake accounts to collect logged-in data. | verdverm wrote: | No, the case did not decide anything, no precedent was | set. The point is that you cannot use this case to argue | that you can scrape public data free of consequence | drexlspivey wrote: | It looked for a while that DeepMind was far ahead from all | competition in the AI race, releasing stuff like Alphafold, | Alphazero etc. What happened and it's OpenAI releasing all the | cool stuff now? Are they focused on other endeavors than LLMs? | | There is also a rumor that there has been a falling out between | Google and Deepmind so I'm wondering what the story is there. | txsoftwaredev wrote: | And ChatGPT was trained from tons of copyrighted material. Sounds | like fair play. | wdpk wrote: | even if true which it does not seem to be the case, the whole | thing sounds pretty marginal, in order to train a model that is | most likely significantly bigger than 100b parameters, one also | needs orders of magnitude more training data than the small 120k | chat that were shared on the ShareGPT website | halfeatenscone wrote: | Such logs would not be used for training the base model, but | rather for fine-tuning the model for instruction following. | Instruction tuning requires far less data than is needed for | pre-training the foundation model. Stanford Alpaca showed | surprisingly strong results from fine-tuning Meta's LLaMA model | on just 52k ChatGPT-esque interactions | (https://crfm.stanford.edu/2023/03/13/alpaca.html). | thallium205 wrote: | I actually believe them because bard is trash compared to gpt | right now. | tablespoon wrote: | I hope they trained it on the insane ChatGPT conversations. Maybe | it could be the very start of generated data ruining the ability | to train these models on massive amounts of genuine human-created | data. Hopefully the models will stagnate or regress because | they're just training on older models' output. | squarefoot wrote: | Heh, imagine the day most of online content will be AI generated, | good luck guaranteeing that AI X,Y,Z, ... etc. won't feed each | other, possibly even circularly. | QuiDortDine wrote: | Circular reporting will be the only reporting! | | https://en.wikipedia.org/wiki/Circular_reporting | seydor wrote: | Funny how NOBODY seems to care that all of their training data, | including sharegpt is copyrighted by end users. Not openai or | google | datkam wrote: | It only matters when it hurts a large corporation, | apparently... | naillo wrote: | I think we should all basically come to a consensus on the idea | that it's morally right to steal/train from chatgpt (or any other | model) given that the whole shoggoth wouldn't be a thing without | all our data to feed it. | sdfghswe wrote: | I say all the time that google has been catching up for many | years, but this is a new low. | mattbee wrote: | Good luck to them. AI models are automated plagiarism, top to | bottom. None of us gave OpenAI permission to derive their model | from our writing, surely billions of dollars worth, but they took | it anyway. Copyright hasn't caught up so all that stolen value | rests securely with OpenAI. If we're not getting that back, I | don't see why AI competitors should have any qualms about | borrowing each others' work. | kmeisthax wrote: | Yeah, I definitely like to see AI companies getting hit with | their own medicine. The main problem isn't even "automated | plagiarism": the pre-generative era was chock full of AI | companies more or less stealing datasets. Clearview AI, for | example, trained up its facial recognition technology on your | Facebook photos, without asking for and without getting | permission. | | On the other hand, I genuinely hope copyright _never_ "catches | up", because... | | 1. It is a morally bankrupt system that does not adequately | defend the interests of artists. Most artists _do not_ own | their own work; publishers demand copyright assignment or | extremely broad exclusive licenses as a condition of | publication. The bullies know to ask for _all_ their lunch | money, not just a couple bucks for themselves. Furthermore, | copyright binds noncommercial actors the same as it does | commercial ones, which means unconscionably large damage awards | for just downloading a couple of songs. | | 2. The suggested ways to alter copyright to stop AI training | would require dramatic expansions of copyright scope. Under | current law, the only argument for the AI itself being | infringing would be if it memorized training data. You would | need to create a new ownership right in artistic styles or | techniques. This would inflict unconscionable amounts of | psychic and legal damage on all future creators: _existing_ | artists would be protected against AI, but no new art could be | legally made unless it religiously hewed towards styles already | in the public domain. We know this because music companies have | already made their domain of copyright effectively work this | way[0], and the result is endless bullshit lawsuits on people | who write songs that merely "feel" too similar (e.g. _Blurred | Lines_ ) | | 3. AI will still be capable of plagiarism. Most plagiarists are | not just hoping the AI regurgitates training data, they are | actively putting other people's work into the model to be | modified. A lot of attention is paid to the sourcing of | training data, because it's a weak spot. If we take the | training data away then, presumably, there's no generative AI. | However, people are working on licensed datasets and training | AIs on them. Adobe has Firefly[1], hell even I've tried my hand | at training from scratch on public domain images. Such models | will still be perfectly capable of doing img2img or being | finetuned and thus copying what you tell it to. | | If we specifically want to regulate AI, then we need to pass | laws that regulate AI, rather than just giving the music | labels, movie studios, and book publishers _even more_ power. | | [0] Specifically through sampling rights and thin copyright. | | [1] I do not consider Adobe Firefly to be _ethical_ : they are | training the AI on Adobe Stock images, and they claim this to | be licensed because they updated the Adobe Stock agreement to | have a license in it. Dropping a contractual roofie into stock | photographers' drinks does not an ethical AI make. | danShumway wrote: | I'm not a copyright maximalist, and I kind of agree that | training should be fair use. Maybe I'm right about that, maybe | I'm wrong. BUT importantly, that has to go hand in hand with an | acknowledgement that AI material is not copyrightable and that | training on other model output is fine. | | What companies like OpenAI want is a system where everything | they build is protected, and nothing that anyone else builds is | protected. It's wildly hypocritical, what's good for the goose | is good for the gander. | | That some AI proponents are now freaking out about how model | output can be legally used shows that on some level those | people weren't really honestly engaging with artists who were | freaking out about their work being appropriated to copy them. | It's all just "learning from the art" until it affects | somebody's competitive moat, and then suddenly people do | understand how LLM weights could be seen as a derivative work | of their inputs. | seydor wrote: | That shouldn't be hard. Are Google's results copyrightable? | shagie wrote: | Building things and maintaining it as a trade secret can be | protected as a trade secret. | | Trade secrets don't need to be copyrightable (e.g. list of | customer numbers is a trade secret but not copyrightable). | | https://copyrightalliance.org/faqs/difference-copyright- | pate... | | > Trade secret protection protects secrets from unauthorized | disclosure and use by others. A trade secret is information | that has an economic benefit due to its secret nature, has | value to others who cannot legitimately obtain it, and is | subject to reasonable efforts to maintain its secrecy. The | protections afforded by trade secret law are very different | from others forms of IP. | mattnewton wrote: | I am not a lawyer, but I don't believe a trade secret would | prevent someone from reverse engineering your model's | knowledge from it's output though, in the same way that it | doesn't prevent someone from reverse engineering your hot | sauce from buying a bunch and experimenting with the | ingredients until it tastes similar. | shagie wrote: | Yep, that's correct. | | My point was more of there are protections for things | that aren't copyrightable. If the model is protected as a | trade secret, then it is a trade secret. | | The example of the hot sauce recipe is quite apt - the | recipe isn't copyrightable, but you can be certain that | the secret formula for how to make Coca-Cola syrup is | protected as a trade secret. | | https://www.coca-colacompany.com/company/history/coca- | cola-f... | waselighis wrote: | Our writing, our code, our artwork... Furthermore, the U.S. | Copyright Office (USCO) concluded that AI-generated works on | their own cannot be copyright, so these ChatGPT logs are free | game. It would be hypocritical to think that Google is wrong | and OpenAI is not. | eru wrote: | > Furthermore, the U.S. Copyright Office (USCO) concluded | that AI-generated works on their own cannot be copyright, so | these ChatGPT logs are free game. | | Doesn't this depend on where you or the AI live? The US ain't | the world. | 100721 wrote: | Microsoft and Google are both US-based companies. | lxgr wrote: | But clearly everything generated by an AI isn't automatically | in the public domain. That would be a trivial way of | copyright laundering. | | "Sorry, while this looks like a bit for bit copy of a popular | Hollywood movie, it was actually entirely dreamt up by our | new, sophisticated, definitely AI-using identity function." | raincole wrote: | Uh, I think there is some confusion here. | | If I plagiarize a Hollywood movie, then I explicitly "give | up" my copyright by "releasing" it to the public domain, it | doesn't affect the movie at all. AI or not is irrelevant. | ysavir wrote: | No, but the original copyright holder would have to press | charges against Bard. OpenAI wouldn't be able to take | action there. | LegitShady wrote: | The person using something similar to something else may be | infringing but the ai work cannot be protected by copyright | as it lacks human authorship. Those are two separate | issues. | LegitShady wrote: | its not even that on their own those works cant be | copywritten. its that even when you make changes to those | works, your changes might qualify for copyright but they do | not affect the copyright status of the ai generated portions | of the work. | | if you used ai to design a new superhero and then added pink | shoes, yellow hair, and a beard, only those three elements | would possibly be able to be protected by copywrite. your | additions do not change the status of the underlying ai work | which cannot be protected and is available for anyone to use. | ghostbrainalpha wrote: | How could that be ever really be enforceable? | | If I use an AI tool to design my Superhero, can't I just | submit it without disclosing the help I received from an | AI. | | I get that it would be very nice to prevent AI SPAM | copyrighting of every possible superhero, but if I use the | AI to come up with a concept, then quickly redraw it myself | with pen and paper, I feel like it would never be provable | that it came from an AI. | LegitShady wrote: | you would be committing fraud. what happens if a criminal | commits fraud? | rhtgrg wrote: | > if you used ai to design a new superhero and then added | pink shoes, yellow hair, and a beard | | Wouldn't that depend heavily on the prompt used (among | other factors such as image to image and ControlNet)? You | could be specifying lots of detail about the design in your | prompt, and the AI could only be generating concept artwork | with little variation from what you already provided. | | If I'm already providing the pose, the face, and the outfit | for a character (say via ControlNet and Textual Inversion), | generating <my_character> should be no different from | generating <superman>, that is to say, the copyright | already exists thanks to my work and the AI is just a tool, | the output of which should have no bearing on who owns that | copyright (DC is going to be perfectly able to challenge my | commercial use of AI generated superman artwork). | LegitShady wrote: | According to the copyright board a promot is not anymore | than any person commissioning a work from an artist, | which does not provide copyright, and the lack of human | authorship for the design decisions still stops it from | being protected by copyright. | bko wrote: | I don't get this sentiment. | | For some cases sure, if it repurposes your code that ignores | the license fine. But it's rarely wholesale copying. It's | finding patterns same as anyone studying the code base would | do. | | As for the majority of content written on the internet through | reddit or some social media, what's the harm in ingesting that? | It's an incredibly useful tool that will add huge value to | everyone. It's relatively open, cheap and highly available. | It's worth to it's owners is only a fraction of the value it | will add to society. It has the chance to have as big of an | impact on progress as something like the microprocessor. | | I agree it's free game for other llms to use gpt output as | training data and that's positive. Although it signals | desperation and panic that the largest "ai first" company with | more data than any org in history is caught so flat footed and | has to rely on it. | | Do you really think it would be a better world in which a large | LLM would never be able to be developed? | nickfromseattle wrote: | > what's the harm in ingesting that? | | It means that large tech companies benefit the most from | every incremental piece of content created by humans, in | perpetuity. | waselighis wrote: | > Do you really think it would be a better world in which a | large LLM would never be able to be developed? | | Maybe. I believe the potential for abuse is far greater than | the potential benefits. What is our benefit, a better search | engine? Automating some tedious tasks? Increased | productivity? What are the downsides? People losing their | jobs to AI. Artists/programmers/writers losing value from | their work. Fake online personas indistinguishable from real | people. Unprecedented amounts of spam and misinformation | flooding the internet. Intelligent AIs automatically | attacking and hacking systems at unprecedented scale 24/7. | Chatbots becoming the new interface for most interactions | online and being the moderators of access to information. | Chatbots pushing a single viewpoint and influencing public | opinion (many people complain today about ChatGPT being too | "woke"). And I may just be scratching the surface here. | mattbee wrote: | No, but I believe a large language model is a work that is | 99.9% derivative of its inputs, with all that implies for | authorship and copyright. Right now it's just a heist. | cornholio wrote: | It's definitely a derived work as far as copyright is | concerned: the output would simply not exist without the | copyrighted training data. | | > It's finding patterns same as anyone studying the code base | would do. | | No, it's quite unlike anyone studying data, because it's not | a person with legal rights, such as fair use, but an | automated algorithm. There is absolutely no legal debate that | copyright applies only to human authors, or only to the human | created part of a mixed work, there is vast jurisprudence on | this; by extension, any fair use rights too, exist only for | human users of the works. Derivation by automated means - for | the express economic purpose of out-competing the creator in | the market place, no less - is completely outside the spirit | of copyright. | est31 wrote: | Students in school also will not never learn to read | without being exposed to text. Does this mean that teachers | who write exercise sheets and school text book publishers | now own the copyright of everything students do? | edgyquant wrote: | AI is not a human being or a student in school. It's a | software tool, stop comparing the two. | est31 wrote: | Being in school is also just a tool to knowing stuff, | being able to read, and being around similar aged peers, | etc. | | Whether the knowledge is directly in your brain or in a | device you operate (directly or through an API) shouldn't | really matter. | | If it's forbidden for a human to move a stone with manual | labour, then it's also forbidden to move that stone with | an excavator. This has nothing to do with the person | being a human and the other person being an excavator | controlled by a human: it's not authorized. | | I think that we should allow humans to move stones up the | hill with excavators too. There is no stealing of | excavator fuel from human food sources going on (let's | assume it's not biofuel operated :p). | cornholio wrote: | > If it's forbidden for a human to move a stone with | manual labour, then it's also forbidden to move that | stone with an excavator. | | Sure, but the reverse is false: I can walk on my own feet | through Hyde Park, but I can't ride my excavator there. | | Laws are made by humans for the benefit of humans, it's a | political struggle. Now, large corporation try to exploit | loopholes in the existing copyright framework in order to | expropriate creators of their works. It's standard | uberisation: disrupt existing economic models, insert | yourself as a unavoidable middle man and pauperize the | workforce the provides the actual service. | fauigerzigerk wrote: | I don't think anyone would argue that an AI has fair use | rights as a person, but corporations do. | mdorazio wrote: | > It's definitely a derived work as far as copyright is | concerned - the output would simply not exist without the | copyrighted training data. | | Can you point to a legal case that confirms this? Because | it's not at all clear that this is true from a legal | standpoint. "X would not exist without Y" is not a | sufficient test for derivative works - it's far more | nuanced. | cornholio wrote: | United States copyright law in quite clear on the matter: | | >A "derivative work" is a work based upon one or more | preexisting works, such as a translation, musical | arrangement, dramatization, fictionalization, motion | picture version, sound recording, art reproduction, | _abridgment, condensation, or any other form in which a | work may be recast, transformed, or adapted_. | | The emphasis part clearly applies: not only the AI model | needs to be trained on massive amounts of copyrighted | works *); but without these input works, it displays no | intrinsic creative ability, it has no capacity to produce | a single intelligible word or sketch. All creative | features of its productions are a transformation of (and | only of) the creative features of the inputs, the AI | algorithm has no "intelligence" in the common meaning of | the word and no ability to create original works. | | *) by that, I mean a specific instance of the model with | certain desirable features, for example the ability to | imitate the style of J.K Rowling | anotherman554 wrote: | That's an interesting analysis. The issue isn't really | whether the A.I. has creative ability, though, if we're | talking about whether it infringes copyright. I think | comparing the A.I. to a really simple bot is informative. | | If I wrote a novel that contained once sentence from | 1,000 people's novels, it would probably be fair use | since I hardly took anything from any individual person | and because my novel is probably not harming those other | writers. | | If I wrote a bot that did the same thing, same result, | because my bot uses only a little from everyone's novel | and doesn't harm the original novelist, so it's likely | fair use. | | Now I think a J.K. Rowling A.I. probably takes at least a | little from her when it produces output, but it's not | clear to me how much is actually based on J.K. Rowling | and how much is a dataset of how words tend to be | associated with other words. You could design a J.K. | Rowling A.I. that uses nothing from J.K. Rowling, just | data that is said to be J.K. Rowling-esque. | shagie wrote: | Your one sentence from one thousand works is likely seen | as transformative. | | https://www.copyright.gov/fair-use/ | | > Additionally, "transformative" uses are more likely to | be considered fair. Transformative uses are those that | add something new, with a further purpose or different | character, and do not substitute for the original use of | the work. | | Creating a model from copyrighted works is likely | sufficiently transformative to be non-infringing even if | it is found to be a derivative work. | pmoriarty wrote: | The output of human copyrighted work wouldn't exist if it | weren't for humans training on the output of other humans. | | Humans constantly use cliches in their writing and speech, | and most of what they produce is a repackaged version of | what someone else has written or said, yet no one's up in | arms against this mass of unoriginality as long as it's | human-generated. | | This is anti-AI bias, pure and simple. | mattigames wrote: | It's a bit more nuanced than that, what I mean is that | the slow speed at which humans learn it's a foundation | block of our society, if suddenly some new race of humans | emerged that could read an entire book in a couple of | minutes and achieve lifelong superhuman retention and | assimilation of all that knowledge then we would have the | exact same type of concerns than what we have today about | AI, including how easily they could recreate high quality | art, music and anything else with just a tiny fraction of | the effort that the rest of us need to reach similar | results. | whateveracct wrote: | Startup technologists have been acting like speed of | actions doesn't matter for decades. If a person can do | it, why shouldn't a computer do it 1000x faster? What | could go wrong? It's always been a poor argument at best | and a bad faith one at worst. | mattigames wrote: | Well said. The mindless automation away of everything has | only one logical conclusion in which the creators of such | automations are automated themselves, and even if the | optimists are right and we never get there it doesn't | matter, the chaos it can make just by getting closer at | faster rates than society can adapt is unprecedented, | specially given that the population count is at all times | high and there are many other simultaneous threats that | need our attention (e.g. climate change) | soulofmischief wrote: | Most definitely. Good luck telling the difference between | traditional and AI-empowered art in the near future. | | It's just a new tool for artists, and this anti-AI | sentiment towards copyright is only going to hurt | individual artists, while doing nothing for large | corporations with enough money to play the game. | rebuilder wrote: | Human works are granted copyright so humans can profit | from their creative endeavours (I'm not getting into | whether this is good or not). | | No-one cares about an algorithm in the same way. | edgyquant wrote: | This is irrelevant, full stop. We care about humans, AI | is a tool and your bias comment is either ignorant or | dishonest. | nathan_compton wrote: | AI are not people and the idea that you can be biased | against them is hardly a foregone conclusion. Like maybe | one day when we have AGI, but ChatGPT ain't that. | cycomanic wrote: | There is a difference between a computer and a human and | we tried them already differently in copyright law. For | example copying a program from disk into memory is | typically already considered a copy on a computer (hence | many licences grant you the licence to do this copy), no | such licence is required for a human. | raincole wrote: | > It's definitely a derived work as far as copyright is | concerned | | ...in your head. In the US (and most countries) there is no | such legal case so far. | xdennis wrote: | > It's finding patterns same as anyone studying the code base | would do. | | This is the issue, it's not finding patterns as people do. | | If I read someone's code, book, &c, that's extremely lossy. I | can only pick up a few things from it in the long term. | | But an ML model can store most of what it's given (in a | jumbled format) and can do it from billions of sources. | | It's essentially corporate piracy, but it's not legally | recognized as such because it doesn't store identical | reproductions. | | This hasn't been an issue before because it's recent and | wasn't considered valuable. But now that it's valuable and | Microsoft is going to take all our jobs we have to at least | consider if it's okay if Microsoft can take our work for | free. | jsemrau wrote: | That's the answer to the YC Interview question "What is your | unfair competitive advantage" in a nutshell. Morally it might | be wrong. From a business building perspective it's access that | no one has. | wendyshu wrote: | Is Stack Overflow plagiarism? | anonyfox wrote: | I am strongly in favor of eliminating copyright completely | everywhere, soooo I am pretty fine with that. The other | direction should be more enforce-able: stuff derived from open | data must also be made open again, like the GPL but for data | (and therefore ML stuff). | WoodenChair wrote: | Right but in a world where copyright does exist we arguably | have the worst of both worlds. Small players are not | protected at all from scraping and big players are leveraging | all of their work and have the legal resources to form a | moat. | anonyfox wrote: | sure, so instead of build even higher walled gardens, let | all data be free for everyone :-) | antibasilisk wrote: | The smallest player is the user, and they should have real | ownership over their computers. | shadowgovt wrote: | Apart from the open questions of the quality of such once- | removed-from-human-generated training data... | | I can't speak to the _legality_ of the situation, but the | _morality_ of using, without their consent, data generated by | someone 's AI engine... | | ... that was, itself, trained on other people's data without | their consent... | | ... should be, at the very least, equivalently evil to the | original AI's training. | jstanley wrote: | So... not at all evil? | MrYellowP wrote: | No, it shouldn't. Maybe you should be, at the very least, | considered a questionable person. I do not in any way or form | consider anything to be wrong with what they're doing, but I | question the senses of someone thinking this is immoral or even | evil. | | Keep your subjective nonsense out of this. | [deleted] | jamiek88 wrote: | Every opinion is subjective. | shadowgovt wrote: | So were it to be the case that we should consider building an | AI by scraping people's publicly-available work without their | consent to be immoral (as many whose art was scraped to build | e.g. stable diffusion would argue it should be)... | | Do you not agree (in that context) we should consider | scraping the output of an AI generated via such an immoral | process to create yet another AI also immoral? At the very | least, I'd think we would consider it further laundering of | other people's labor with just extra steps. | famahar wrote: | How the turn tables. Remember when Google called out Microsoft in | 2011 for using Google results? | | https://googleblog.blogspot.com/2011/02/microsofts-bing-uses... | | >We look forward to competing with genuinely new search | algorithms out there--algorithms built on core innovation, and | not on recycled search results from a competitor. | styfle wrote: | I came here to post this | goldfeld wrote: | Google: We look forward to [babble babble empty words we don't | really mean on principle and more corporate speak that we laugh | about having written in the bar.] | | Is there even a single free non-bargained soul behind these | companies' executive functions? | LightBug1 wrote: | So when Google does it, it's a breaking news story ... | | But when OpenAI do it, it's genius? | | Can't believe this is a conversation ... and I'm solid anti- | Google since Google Reader. ___________________________________________________________________ (page generated 2023-03-30 23:00 UTC)