hngopher.com

       [HN Gopher] Google denies training Bard on ChatGPT chats from Sh...
       ___________________________________________________________________
        
       Google denies training Bard on ChatGPT chats from ShareGPT
        
       Author : chatmasta
       Score  : 363 points
       Date   : 2023-03-30 11:16 UTC (11 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | mupuff1234 wrote:
       | This just in, web indexing company scrapes web for data.
        
       | cpeterso wrote:
       | Regardless of whether this happened or not, would training Bard
       | on ChatGPT output be good or bad for Bard's product quality? I
       | imagine there's a risk of AIs recursively reinforcing bad data in
       | their models. This problem seems unavoidable as more web content
       | becomes AI-generated content and spam.
        
       | ankit219 wrote:
       | According to the article, the story goes this way: This engineer
       | Jacob Devlin raised his concerns on training Bard with ShareGPT
       | data. Then he directly joined OpenAI.
       | 
       | He also claims that Google were about to do it, and then they
       | stopped after his warnings. And presumably removed every trace of
       | openai's responses.
       | 
       | A couple of things:
       | 
       | 1. So, Bard could have been trained on ShareGPT but it's not -
       | according to the same engineer who raised the concern (and google
       | denial in the verge).
       | 
       | 2. Since he directly joined OpenAI, he could have told them and
       | they could have taken action, and nothing is public on that front
       | yet. Probably nothing to see here.
       | 
       | Edit: The engineer too wasnt directly involved with the Bard
       | team, it appeared to him that Bard team was heavily relying on
       | ShareGPT.
        
         | binarymax wrote:
         | For those that don't know, Jacob Devlin was the lead engineer
         | and first publisher of the widely popular BERT model
         | architecture, and initial bert-base models released by Google.
         | 
         | https://www.semanticscholar.org/author/Jacob-Devlin/39172707
        
         | [deleted]
        
         | whimsicalism wrote:
         | Your comment doesn't make sense to me.
         | 
         | > Bard team was heavily relying on ShareGPT.
         | 
         | > He also claims that Google were about to do it, and then they
         | stopped after his warnings.
         | 
         | So were they heavily relying or were they about to and then
         | stopped? It's unclear from your comment. Could you link where
         | you're getting this info from? The Information article is
         | walled, unfortunately.
        
           | ankit219 wrote:
           | [1] gives a jist as well.
           | 
           | What I meant to say was that: Acc to The Information article
           | the engineer raised concerns because it appeared to him
           | (article wording) Bard team was using (and heavily reliant
           | on) ShareGPT for Bard training. The engineer wasnt working on
           | Bard and presumably someone told him or somehow he got the
           | impression that Bard team was reliant on ShareGPT. At the
           | time he was at Google.
           | 
           | Then, when he raised concerns to Sundar Pichai, Bard team
           | stopped doing it and also scrapped any traces of ShareGPT
           | data. So, the headline is false and Bard (again presumably)
           | is not trained on any of ShareGPT data.
           | 
           | [1]: https://www.theverge.com/2023/3/29/23662621/google-bard-
           | chat...
        
             | whimsicalism wrote:
             | I think I might be confused by your usage of "about to do
             | it" in your original comment to mean "actively doing it."
             | 
             | You claim that the very engineer accusing Google of
             | training Bard on ShareGPT acknowledges that the final
             | product was not. As far as I can tell, Devlin did no such
             | thing.
             | 
             | Not sure why you would presume they restarted their
             | expensive training process.
             | 
             | It just doesn't seem like a good faith characterization to
             | me.
        
         | rgbrenner wrote:
         | Take what action? Pretty sure that's not illegal, especially
         | since the training data is ai generated and therefore can't be
         | copyrighted.
        
           | chatmasta wrote:
           | I think the oomph behind the story is due to it being
           | embarrassing, rather than illegal.
        
           | dahfizz wrote:
           | OpenAI could have blocked Google's accounts, for example.
           | Nothing really to do with legality.
        
             | sebzim4500 wrote:
             | No one is alleging that Google directly used OpenAI's API
             | to get training data (which would be unambiguously against
             | TOS). The claim is that they downloaded examples from
             | ShareGPT.
        
           | frozenlettuce wrote:
           | Not illegal, but that won't stop people from finding it
           | amusing that the company considered the world's beacon of
           | innovation is copying someone else's homework. It's hard
           | being the favorite horse.
        
             | dvngnt_ wrote:
             | tech companies steal ideas all the time. snapchat invented
             | stories and now whatsapp, facebook, instagram, tiktok,
             | youtube have them
        
       | shmerl wrote:
       | Well, ChatGPT itself was trained on something else, so how is
       | Bard any worse. AIs copying each other is only natural to expect.
        
       | ChatGTP wrote:
       | I couldn't be happier, keep up the good work. Steal away just as
       | OpenAI I have done.
        
       | visarga wrote:
       | This could actually be a good way to sidestep the training set
       | copyright and access right issues. Copyright protection should
       | solely encompass the expression of human generated content and
       | not the underlying concepts.
       | 
       | By training model B using the results generated by model A, the
       | copyright of corpus_A (OpenAI RLHF dataset) remains safeguarded,
       | as model B is never directly exposed to corpus_A, preventing it
       | from duplicating the content verbatim.
       | 
       | This process only transmits the concepts originating from
       | corpus_A, which represents universal knowledge that cannot be
       | claimed by any individual party.
        
       | burakemir wrote:
       | "... as a joke."
        
       | dathinab wrote:
       | People complained that new AI is "stealing" from artists.
       | 
       | But stealing from other AI turns out to often be easier.
       | 
       | And this is where things get fun, because companies like OpenAI
       | want to be able to train on all the data without any explicit
       | permissions from the creators, but the moment people do the same
       | to them they likely (we will see) be very much against it.
       | 
       | So it will be interesting if they will be able to both have and
       | eat the cake (e.g. by using Microsoft lobby to push absurd law)
       | or will they fall apart due to cannibalization making it non
       | profitable to create better AI.
       | 
       | EDIT: This comment isn't specific to Google/Bert, so it doesn't
       | matter weather Google actually did so or weather not.
        
         | commoner wrote:
         | I can see the GitHub Copilot controversy being resolved in this
         | way. If Microsoft, GitHub, and OpenAI successfully use the fair
         | use defense for Copilot's appropriation of proprietary and
         | incompatibly licensed code, then a free and open source
         | alternative to Copilot can be trained on Copilot's outputs.
         | 
         | After all, the GitHub Copilot Product Specific Terms say:
         | 
         | > 2. Ownership of Suggestions and Your Code
         | 
         | > GitHub does not claim any ownership rights in Suggestions.
         | You retain ownership of Your Code.
         | 
         | https://github.com/customer-terms/github-copilot-product-spe...
        
       | century19 wrote:
       | Google accused Microsoft Bing of using them for page rankings a
       | few years ago. Setup a sting to show that when you searched for
       | something unique on Google using MS Explorer, shortly afterwards
       | the same search result would start showing up on Bing.
       | 
       | This was seen as deeply embarrassing for Microsoft at the time.
        
         | godzillabrennus wrote:
         | The deeply embarrassing period at Microsoft began and ended
         | when Ballmer ran the show. The Bing results saga was the
         | hangover.
        
         | blisterpeanuts wrote:
         | Embarrassing, maybe, but imitation is the sincerest form of
         | flattery.
        
           | int_19h wrote:
           | Indeed, which is why the biggest impact this revelation is
           | likely to have (if proven true) is on Google's stock.
        
       | brucethemoose2 wrote:
       | This is also bad because the risk of AI "inbreeding" is real. I
       | have seen invisible artifact amplification happen in a single
       | generation training ESRGAN on itself.
       | 
       | Maybe it wont happen in a single LLM generation, but perhaps gen
       | 3 or 5 will start having really weird speech patterns or
       | hallucinations because of this.
        
         | sebzim4500 wrote:
         | Worst case scenario they just start only training on pre-2020
         | data and then finetuning on a dataset which they somehow know
         | to be 'clean'.
         | 
         | In practice though I doubt that AI contamination is actually a
         | problem. Otherwise how would e.g. AlphaZero work so well (which
         | is effectively _only_ trained on its own data).
        
           | whimsicalism wrote:
           | The parallels with AlphaZero are not so easy.
           | 
           | The problem is you need some sort of arbiter of who has "won"
           | a conversation but if the arbiter is just another transformer
           | emitting a score, the models will compete to match the
           | incomplete picture of reasoning given by the arbiter.
        
           | brucethemoose2 wrote:
           | It could degrade the model in a way that avoids the metrics
           | they use for gauging quality.
           | 
           | The distortions that showed up in ESRGAN (for instance) didnt
           | seem to effect the SSIM or anything (and in fact it was
           | training with MS SSIM loss), but the "noise splotches" and
           | "swirlies" as I call them were noticable in some of the
           | output, but you have to go back and look _really_ hard at the
           | initial dataset to spot what it was picking up. Sometimes,
           | even after cleaning, it felt like what it was picking up on
           | was completely invisible.
           | 
           | TLDR Google may not even notice the inbreeding until its
           | already a large issue, and they may be reluctant to scrap so
           | much work on the model.
        
       | gigel82 wrote:
       | Where are all those people that kept saying Google had an amazing
       | model way beyond ChatGPT internally for years? Those comments
       | always kept coming up in ChatGPT posts; maybe they'll stop now.
        
       | Imnimo wrote:
       | I don't care at all about this from a copyright or data ownership
       | perspective, but I am a little skeptical that it's a good idea to
       | be this incestuous with training data in the long run. It's one
       | thing to do fine tuning or knowledge distillation for specialized
       | domains or shrinking models. But if you're trying to train your
       | own foundation model, is relying on output from other foundation
       | models going to make them learn to imitate their own errors?
        
         | sdenton4 wrote:
         | Things like ShareGPT or PromptHero give vast repositories of
         | human-curated ML outputs, which make them fantastic for at
         | least incremental improvement on the base model. In the grand
         | scheme of things, these will be just another style, mixed in
         | with all the other crap in the training set, so I don't imagine
         | it's too harmful... eg, 'paint starry night in the style of
         | midjourney 5'
        
         | berkle4455 wrote:
         | Where are any LLMs going to get data from as they become more
         | ubiquitous and humans produce less publicly accessible original
         | and thoughtful content?
         | 
         | The whole thing is a plateaued feedback loop.
        
           | TillE wrote:
           | It'd be cool to have an LLM that's trained almost exclusively
           | on books from good publishers, and other select sources.
           | Working out licensing deals would be a challenge, of course.
        
             | whimsicalism wrote:
             | Corpora is likely too small. It would just be an "LM"
        
           | whimsicalism wrote:
           | Probably from multiple modalities as well as extending the
           | sequence lookback length further and further.
           | 
           | They have low perplexity now, but the perplexity possible
           | when predicting the next word on page 365 of a book where you
           | can attend over the last 364 pages will allow even more
           | complexity to emerge.
        
         | whimsicalism wrote:
         | But Bard isn't a foundation model?
         | 
         | Clearly this data has value as some sort of RLHF finetuning
         | dataset. Honestly they probably used it for negative examples.
        
       | kleiba wrote:
       | Hard to believe that is true, or else Bard would probably not
       | perform so bad.
        
         | waselighis wrote:
         | Google only has a fraction of the training data. OpenAI had a
         | huge head start and has been collecting training data for years
         | now. ChatGPT is also wildly popular which has given them tons
         | more training data. It's estimated that ChatGPT gained over 100
         | million users in the first two months alone, and may have over
         | 13 million active users daily.
         | 
         | The logs on ShareGPT are merely a drop in the bucket.
        
           | rocmcd wrote:
           | > Google only has a fraction of the training data.
           | 
           | Uh, what? The same Google that has been crawling, indexing,
           | and letting people search the entire Internet for the last 25
           | years? They have owned DeepMind for nearly twice as long as
           | OpenAI has been in existence!
           | 
           | If anything this is proof that no one at Google can get
           | anything done anymore, and lack of training data ain't the
           | problem.
        
             | mirker wrote:
             | The alignment portion of training requires you to have
             | upvote/downvote data on many LLM responses. Google's
             | attempt at that (at least according to the news so far) was
             | asking all employees to volunteer time ranking the
             | responses. Combined with no historical feedback from
             | ChatGPT, they are behind.
        
         | duringmath wrote:
         | Bard is only a week old and has a large "experimental" sticker
         | on it. Besides its UI is better and the answers are succinct
         | which I prefer.
        
           | bastardoperator wrote:
           | They literally copied the Chatgpt UI, lol, only it looks like
           | a dated Google UI. How do you prefer answers with less
           | data?... that's crazy.
        
             | dvngnt_ wrote:
             | doing a visual diff will show you it's not a literal copy
        
               | bastardoperator wrote:
               | I'm talking design, not code, lol...
        
             | duringmath wrote:
             | I just don't want to be hit with a wall of text every
             | single time, it gets the point across with minimal padding
             | (high signal to noise ratio), ChatGPT feels like it gets
             | paid by the word and they do actually charge by token if
             | you use the API.
             | 
             | As for the UI it's a take on the tried and true chat UI
             | same as ChatGPT's, it spits the whole answer at once
             | instead of feeding it to you one word at a time, it has an
             | alternative drafts button, the Google it button is a nice
             | touch and it feels quicker.
        
               | bastardoperator wrote:
               | You can combat that in the prompt, I use "just code, no
               | words" which will also remove code comments from output.
               | Bard doesn't respect the same request. You can be more
               | succinct with chatgpt. Half the things I ask for in Bard
               | give me this:
               | 
               | "I'm still learning coding skills, so at the moment I
               | can't help with this. I'm trained to do things like help
               | you write lists about different topics, compare things,
               | or build travel itineraries. Do you want to try any of
               | those now?"
        
               | duringmath wrote:
               | Longer instructions? Which part of "less is more" do you
               | not understand?
        
               | bastardoperator wrote:
               | What part of succinct do you not understand? Bard
               | provides a bunch of useless text too, only you can't get
               | rid of it. No worries, you don't know how to use chatgpt,
               | have fun with Bard until Google cancels it.
        
         | karmasimida wrote:
         | Yeah, Bard's replies aren't nothing like that from ChatGPT.
         | 
         | I wonder is it possible to use ChatGPT for competitor analysis?
         | 
         | If the responses are not used in the final training data I
         | don't see how this is being something controversial
         | 
         | Also if Google's compliance team can't even do, as recognizing
         | this level of legal risk, even if there are probably an army of
         | top paid lawyers they hired, I don't know what to say. Maybe
         | they should fall then.
        
       | m00x wrote:
       | ITT armchair lawyers LARPing.
        
       | croes wrote:
       | Would 112k conversations make a huge difference in the model?
        
         | int_19h wrote:
         | For fine-tuning, yes, absolutely.
        
       | social_quotient wrote:
       | It's interesting when we say Google did this. It's actually and
       | likely some people that work for Google and are on this forum did
       | this. Knowingly, not by accident while slurping up the rest of
       | the internet, and got paid to do it. I wonder what the engineer
       | view on this was/is. I have to assume they ballpark know the
       | terms of the openai data (regardless if you disagree or not).
       | 
       | Anyone care to steel man the argument for why this was a good
       | idea?
        
         | hackerlight wrote:
         | > Anyone care to steel man the argument for why this was a good
         | idea?
         | 
         | I don't see a big difference between this and training it on
         | people's code and art which also happens without explicit
         | permission.
        
         | Nimitz14 wrote:
         | I don't understand why it's a bad idea? Did openai ask for
         | permission for using the data it uses (no)?
        
       | seanhunter wrote:
       | "What's sauce for the goose is sauce for the gander" as the legal
       | cliche goes. OpenAI cannot on the one hand claim that google did
       | something wrong if they used their outputs as part of the bard
       | training while simultaneously on the other hand claiming they
       | themselves are free to use everyone on the internets content to
       | train their model.
       | 
       | Either they believe that training should respect copyright (in
       | which case they could not do what they do) or they believe that
       | training is fair use (in which case they cannot possibly object
       | to Google doing the same as them).
        
         | az226 wrote:
         | A big whoosh here. OpenAI is fair use because an LLM is
         | transformative from the content it gathered. Bard is literally
         | the same product as ChatGPT, so it is not transformative at
         | all. Tell me you know nothing about copyright without telling
         | me you know nothing about copyright.
        
           | cornholio wrote:
           | That's nonsensical. An AI is either transformative or it's
           | not, it's an intrinsic quality that has nothing to do with
           | the training data or the "product" type. If OpenAI is
           | sufficiently transformative to claim fair use (which I don't
           | believe for a second, alas), then any other AI built on
           | similar fundamentals has the same claims and can crunch any
           | data their creators see fit, including the output of other
           | AIs.
        
         | sebzim4500 wrote:
         | No one is alleging copyright violations. The claim is that they
         | violated OpenAI's terms of service. We don't know whether
         | Google ever even agreed to those terms of service in the first
         | place.
        
           | seanhunter wrote:
           | Are OpenAI saying they have adhered to the terms of service
           | of all the content they have used?
        
             | dragonwriter wrote:
             | _Content_ is not subject to terms of _service_.
             | 
             |  _Services_ are subject to terms of service. (If content is
             | received through a service, the terms of service may govern
             | use of it, but that's not a feature of the content, but the
             | acquisition route.)
        
               | deckard1 wrote:
               | Terms of Service, Terms and Conditions, and Terms of Use
               | are all the same thing. There is no legal difference
               | between them.
               | 
               | > that's not a feature of the content, but the
               | acquisition route.
               | 
               | It's neither. It's a feature of contract law.
        
               | danShumway wrote:
               | ShareGPT isn't part of that service though. Yes, it would
               | be a TOS violation if Google directly used ChatGPT to
               | generate transcripts -- but not even the original Twitter
               | thread is claiming that.
               | 
               | The only claim being made against Google here is that
               | they used ChatGPT _content_. I can 't find any sources
               | claiming that Google made use of an OpenAI service. So
               | the distinction is correct, but doesn't seem particularly
               | valuable in this context -- using data from ShareGPT is
               | not a TOS violation.
        
       | ar9av wrote:
       | I love that OpenAI uses a ton of other peoples work to train
       | their model, yet when someone uses OpenAI to train their model,
       | they get all up in arms.
       | 
       | As far as I'm concerned, OpenAI has decided terms of use don't
       | exist anymore.
        
         | jug wrote:
         | OpenAI is training on data that is against their terms of use?
         | That reads like a serious allegation. What is this all about?
        
           | cycomanic wrote:
           | OpenAI is training on copyrighted data without a licence. I
           | would argue copyright law has much stronger legal standing
           | than some ToS.
           | 
           | Now OpenAI is arguing their training is fair use, but that
           | has certainly not been legally established so far and could
           | just as much be used as a defence against ToS violation.
           | 
           | So in short yes OpenAI is pretty much doing the same thing.
        
         | modernpink wrote:
         | Where are they up in arms?
        
       | paxys wrote:
       | 1. Google denies doing it, so at the very least the title should
       | have an "allegedly".
       | 
       | 2. Even if they did - so what? The output from ChatGPT is not
       | copyrightable by OpenAI. In fact it is OpenAI that is training
       | its models on copyrighted data, pictures, code from all over the
       | internet.
        
         | manojlds wrote:
         | But remember many years back when it was news that Bing used
         | Google search results to improve its results.
        
           | magicalist wrote:
           | It's not quite the same thing, because Bing was getting the
           | data from a browser toolbar and watching the search terms
           | used and where the user went afterwards.
           | 
           | A closer equivalent would be if someone had made a ShareSERP
           | site and people posted their favorite search terms and the
           | results Google gave and Bing crawled that and incorporated
           | the search terms to links connections into their search
           | graph.
           | 
           | The actual actions had _maybe_ gone too far (personally I
           | thought it was more funny than  "copying"), the hypothetical
           | would be pretty much what you'd expect to happen. Even google
           | would probably crawl ShareSERP and inadvertently reinforce
           | their own results (the same way OpenAI presumably gets more
           | than a bit of their own results back at them in any new
           | crawls of reddit, hn, etc even if they avoid sites like
           | ShareGPT deliberately).
        
           | cma wrote:
           | > Google catches Bing copying [search results], Microsoft
           | says "so what?"
           | 
           | https://arstechnica.com/information-
           | technology/2011/02/googl...
        
         | Jimmc414 wrote:
         | >Even if they did - so what?
         | 
         | Amplification of biases, propagation of errors, echolalia and
         | over-optimization, lack of diverse data, overfitting
        
           | funkyjazz wrote:
           | Not to mention it's embarrassing. Google playing second
           | banana to OpenAI.
        
             | nicehill wrote:
             | I think Amazon was first in the (free) banana business
        
               | jrirhfifj wrote:
               | you joke, but first producy they changed on whole foods
               | were the bananas.
               | 
               | before: organic (south america) and regular (central ou
               | SEA) for 69, 59.
               | 
               | then: both chikita's brand with regular and organic
               | stickers (clearly the same produce, always from SEA) for
               | 49 and 39 cents.
               | 
               | thats was days after the announcement
        
               | bbarnett wrote:
               | Did you inadvertently reverse to regular/organic order,
               | or was organic cheaper after?
        
             | prepend wrote:
             | Google's been second banana to openai for a few years now,
             | right?
        
             | ithkuil wrote:
             | That assumes that training on the output of another
             | language model somehow gives you the ability to improve
             | your model and to catch up somehow
        
               | iandanforth wrote:
               | It does. In general this is known as teacher-student
               | training or knowledge distillation. It works better if
               | you have access to the activations of the model but you
               | can work with just outputs as well.
        
               | satvikpendem wrote:
               | Well, it does, that's how we got Alpaca from LLaMA.
        
           | jrirhfifj wrote:
           | you talk like chatgpt was some bastion of curated perfectly
           | correct content. get a grip. web scraping is web scraping.
        
           | RosanaAnaDana wrote:
           | I mean maybe. There also might be something to this. OpenAI
           | has been very opaque about training techniques.
        
           | paxys wrote:
           | That's just the base concern with every single model
           | regardless of where they sourced their data from. Garbage in,
           | garbage out.
        
             | educaysean wrote:
             | Sure. Does that fact mean we're prohibited from expressing
             | concerns about data quality? ShareGPT isn't representative
             | of authentic, quality writing.
        
             | Jimmc414 wrote:
             | Right, but training an LLM on the output of another LLM can
             | certainly exacerbate these issues
        
               | paxys wrote:
               | Maybe, but we are fast approaching the point (or more
               | likely have crossed it already) where distinguishing
               | between human and AI generated data isn't really
               | possible. If Google indexes a blog, how does it know
               | whether it was written with AI assistance and therefore
               | should not be used for training? Heck, how does OpenAI
               | itself prevent such a feedback loop from its own output
               | (or that of other LLMs)?
        
               | madeofpalk wrote:
               | > If Google indexes a blog, how does it know whether it
               | was written with AI assistance and therefore should not
               | be used for training
               | 
               | Yes, this is an existential problem for Google and
               | training future LLMs.
               | 
               | See also, https://www.theverge.com/23642073/best-
               | printer-2023-brother-... and
               | https://searchengineland.com/verge-best-
               | printer-2023-394709
        
               | abduhl wrote:
               | Your argument would have a lot more force if we were past
               | that point rather than fast approaching that point.
               | Concerns about training data errors being compounded are
               | much more important when you're talking about the
               | bleeding edge.
               | 
               | And your question about how OpenAI prevents their
               | training data from being corrupted is one we should be
               | asking as well!
        
               | rightbyte wrote:
               | > Heck, how does OpenAI itself prevent such a feedback
               | loop from its own output (or that or other LLMs)?
               | 
               | Seems trivial. Only use old data for the bulk? Feed some
               | new data carefully curated?
        
               | toxik wrote:
               | Future job: token selector / archiving
        
               | notahacker wrote:
               | <meta name="generator" content="human brain">
               | 
               | I'm only half joking.... I think we likely will end up
               | with flags for human generated/curated content (and it
               | will have to be that way round, as I can't imagine
               | spammers bothering to put flags on AI-generated stuff),
               | and we probably already _should_ have an equivalent of
               | robots.txt protocol that allows users to specify which
               | parts of their website they would and wouldn 't like used
               | in the training of LLMs.
        
               | jfk13 wrote:
               | If content with a "human-generated" flag is rated more
               | highly in some way -- e.g. search results -- then _of
               | course_ spammers will automatically add that flag to
               | their AI-generated garbage. How do you propose to prevent
               | them?
        
               | notahacker wrote:
               | I assume, like the actual meta generator tags, it
               | wouldn't actually be a massive boon for regular search
               | results
        
               | shubhamkrm wrote:
               | Reminds me of the old "evil bit" RFC[1]
               | 
               | [1] https://www.ietf.org/rfc/rfc3514.txt
        
         | KRAKRISMOTT wrote:
         | OpenAI Terms of service forbid training competitor models via
         | their ML outputs (LoRa alpaca laundering is probably not
         | allowed for commercial use).
        
           | worldofmatthew wrote:
           | Are the TOS even enforceable is AI content can't be
           | copyrighted?
        
           | space_fountain wrote:
           | Where exactly does it do that? I looked a bit and could t
           | find it, but likely I was just wrong
        
           | short_sells_poo wrote:
           | I love it how they don't want others to use their model
           | output but they have no qualms about training their model on
           | the copyrighted works of others? Isn't this a stunning level
           | of hypocrisy?
        
           | Certhas wrote:
           | This is really hilarious. Authors and artists never gave
           | permission to use their work to train AI models either...
           | 
           | Not legally the same situation, but ethically close enough.
        
           | saurik wrote:
           | So, to verify, are you claiming that if someone added a
           | similar clause to their source code and then GitHub went
           | ahead and trained Copilot against it, that would be an issue?
        
             | bloppe wrote:
             | You relinquish all licensing rights when you upload your
             | code to GitHub. Microsoft can do whatever they want with
             | it. That's in their ToS, which you have to agree to when
             | you make an account. Normally, only affirmatively accepted
             | ToS are enforceable, so just putting a clause into your
             | license doesn't work (unless it's a copyright, which
             | doesn't require consent).
        
               | flir wrote:
               | > You relinquish all licensing rights when you upload
               | your code to GitHub
               | 
               | What now? Seriously?
               | 
               | I found this. Section D4.
               | 
               | "We need the legal right to do things like host Your
               | Content, publish it, and share it. You grant us and our
               | legal successors the right to store, archive, parse, and
               | display Your Content, and make incidental copies, as
               | necessary to provide the Service, including improving the
               | Service over time. This license includes the right to do
               | things like copy it to our database and make backups;
               | show it to you and other users; parse it into a search
               | index or otherwise analyze it on our servers; share it
               | with other users; and perform it, in case Your Content is
               | something like music or video."
               | 
               | "as necessary to provide the Service" seems critical.
        
               | bloppe wrote:
               | "Improving the service over time" can do a lot of heavy
               | lifting, definitely including training Copilot.
        
               | commoner wrote:
               | Also, section D3 of the GitHub Terms of Service says:
               | 
               | > You retain ownership of and responsibility for Your
               | Content.
               | 
               | and section D4 says:
               | 
               | > This license does not grant GitHub the right to sell
               | Your Content. It also does not grant GitHub the right to
               | otherwise distribute or use Your Content outside of our
               | provision of the Service, except that as part of the
               | right to archive Your Content, GitHub may permit our
               | partners to store and archive Your Content in public
               | repositories in connection with the GitHub Arctic Code
               | Vault and GitHub Archive Program.
               | 
               | There is nothing in the terms that requires the GitHub
               | user to relinquish all licensing rights.
               | 
               | https://docs.github.com/en/site-policy/github-
               | terms/github-t...
        
               | bloppe wrote:
               | The clauses always have a trap door: "[outside of] our
               | provision of the Service" means they can do anything as
               | long as it's a service they provide.
               | 
               | Under definitions: _The "Service" refers to the
               | applications, software, products, and services provided
               | by GitHub, including any Beta Previews._
        
               | commoner wrote:
               | I think there's a misunderstanding over what the word
               | "relinquish" means.
               | 
               | The terms make clear that uploading code to GitHub gives
               | GitHub the right to "store, archive, parse, and display
               | Your Content, and make incidental copies, as necessary to
               | provide the Service, including improving the Service over
               | time" while the code is hosted on GitHub.
               | 
               | However, that's not the same thing as relinquishing
               | (giving up) licensing rights to GitHub. The uploader
               | still retains those rights, and there is nothing in the
               | terms that says otherwise.
        
               | gcr wrote:
               | The question turns on whether you consider copilot part
               | of the "GitHub service."
               | 
               | GitHub would argue that it is, and they'd likely argue
               | that charging for access to copilot is akin to charging
               | for access to private repositories.
               | 
               | Others would say that copilot is somehow separate from
               | the services Github provides, so using their code for
               | CoPilot wouldn't be covered by the ToS.
        
               | bloppe wrote:
               | It is certainly a service that's being provided. If not
               | by GitHub, then by whom?
               | 
               | I'll repeat the definition of service: _The "Service"
               | refers to the applications, software, products, and
               | services provided by GitHub, including any Beta
               | Previews._
        
               | cycomanic wrote:
               | So do you believe if you hosted a closed source project
               | on GitHub, and GitHub decided they want to integrate this
               | into their service they would simply be allowed to take
               | the code?
               | 
               | Fortunately HN commenters are not judges. And I would
               | wager any bet that MS lawyers would not try to argue
               | based on their ToS either, that would be a recipe for
               | loosing any court case.
        
               | bloppe wrote:
               | I just mean that it doesn't really matter what your
               | license says as long as GitHub can come up with a
               | business justification for using it in some way.
               | Certainly, other users still legally have to obey your
               | copyright.
        
               | saurik wrote:
               | So, to verify, are you claiming it would not be allowed
               | for _you_ to upload _my_ otherwise-open-source code (code
               | I do not myself host at GitHub, but which was reasonably
               | popular  / important code) to GitHub?
        
               | bloppe wrote:
               | Yep. It's in their ToS:
               | 
               |  _If you 're posting anything you did not create yourself
               | or do not own the rights to, you agree that you are
               | responsible for any Content you post; that you will only
               | submit Content that you have the right to post; and that
               | you will fully comply with any third party licenses
               | relating to Content you post._
               | 
               | I suppose this means if I upload your stuff to GitHub,
               | and you sue GitHub, then GitHub would be able to somehow
               | deflect liability onto me.
        
               | commoner wrote:
               | That doesn't make sense. For example, GPLv3 allows anyone
               | to redistribute the software's source code if the license
               | is intact:
               | 
               | > You may convey verbatim copies of the Program's source
               | code as you receive it, in any medium, provided that you
               | conspicuously and appropriately publish on each copy an
               | appropriate copyright notice; keep intact all notices
               | stating that this License and any non-permissive terms
               | added in accord with section 7 apply to the code; keep
               | intact all notices of the absence of any warranty; and
               | give all recipients a copy of this License along with the
               | Program.
               | 
               | https://www.gnu.org/licenses/gpl-3.0.en.html
               | 
               | If GitHub then uses the source code in a way that
               | violates the license, there is no provision in the GitHub
               | terms of service that would allow GitHub to deflect legal
               | liability to the GitHub user who uploaded the program.
               | The uploader satisfied the requirements of GPLv3, and
               | GitHub would be the only party in violation.
        
               | 8note wrote:
               | Uploading is granting GitHub a license separate from the
               | gpl license.
               | 
               | If you can't actually grant that separate license, you're
               | misrepresenting your ownership and license to that code
        
           | vagabund wrote:
           | Google has no contract with OpenAI though. They used a third
           | party site to scrape conversations. If the outputs themselves
           | are not copyrighted, and they never agreed to the terms of
           | service, it should be fine, right? Albeit unethical and
           | embarrassing.
        
             | [deleted]
        
             | paxys wrote:
             | Hardly unethical, considering OpenAI is doing exactly this.
        
               | layer8 wrote:
               | Two wrongs don't make a right.
        
               | pantalaimon wrote:
               | It's still debatable if training a computer neutral
               | network on public data is 'wrong' when we very much
               | accept it as a right for biological neural networks.
        
               | asddubs wrote:
               | forgive me if i have limited sympathy when a burglars
               | house gets robbed
        
               | kbrkbr wrote:
               | This
        
               | WillPostForFood wrote:
               | It's even less worthy of sympathy - like a counterfeit
               | piece of art being counterfeited. And there isn't even an
               | original, just like a made up counterfeit.
        
               | vagabund wrote:
               | You can quibble about the ethics of web scraping for ML
               | in general but I think you're conflating issues.
               | 
               | OpenAI and Google both scour the web for human-generated
               | content. What Google cares about here is the learnings
               | from OpenAI's proprietary RLHF dataset, for which they
               | had to contract a large sum of human labelers. Finding a
               | roundabout way to extract the value of a direct
               | competitor's purpose-built, costly data feels
               | meaningfully different from scraping the web in general
               | as an input to a transformative use.
        
               | abeppu wrote:
               | If there's a party which has intentionally conflated
               | scraping web content in general with scraping it to build
               | a direct competitor to the original sources, that party
               | is Google.
               | 
               | Yes, this latest instance with OpenAI outputs is shady,
               | but I think it's in the same spirit as scraping news
               | organizations for content which journalists were paid to
               | write, and then showing portions of it directly in
               | response to queries so people don't go directly to the
               | news organization's pages, and it's in the same spirit as
               | showing answers to query-questions that are excerpts from
               | scraped pages which another organization paid to produce.
        
               | bloppe wrote:
               | I see no difference. Any web scraping is a means to
               | deflect revenue-generating traffic to yourself, and away
               | from other websites. Fewer people will go to Stack
               | Overflow because of Codex and Copilot. The point that the
               | content was paid for vs volunteered becomes moot once
               | it's posted publicly online for free, on ShareGPT.
        
               | shmel wrote:
               | So what? Is OpenAI RLHF dataset more valuable than
               | millions of books and paintings OpenAI used for free
               | without stopping a second? Why is that? Because one big
               | tech corp paid money for that dataset?
        
               | ClumsyPilot wrote:
               | > labelers. Finding a roundabout way to extract the value
               | of a direct competitor's purpose-built, costly data feels
               | meaningfully different from scraping the web in general
               | as an input to a transformative use
               | 
               | There we go again, its, one law for the unwashed plebs
               | and the other for us.
               | 
               | Why do you think that I, after spending my time and
               | effort to write my blog, own my content to a lesser
               | extent that OpenAI does their? Such hypocracy.
        
               | paxys wrote:
               | > OpenAI and Google both scour the web for human-
               | generated content
               | 
               | OpenAI and Google both scour the web for content, period.
               | That content could be human generated or AI generated or
               | a mix of the two. Neither company is respecting copyright
               | or terms of service of every individual bit of data
               | collected. Neither company cares how much effort was put
               | into creating the data, whether humans were paid to do
               | it, or whatever else. So there really isn't that much
               | difference between the two. In fact I can guarantee that
               | there was _some_ Google-generated content within OpenAI
               | 's training data.
        
               | vkou wrote:
               | And herein is the main problem of AI. Its creators
               | consume knowledge from the commons, and give nothing free
               | and unencumbered back.
               | 
               | It's like the guy who never brings anything to the
               | potluck, but after everyone finishes eating, he boxes up
               | the leftovers, and starts selling them out of a food
               | cart.
        
             | kweingar wrote:
             | > Albeit unethical and embarrassing.
             | 
             | I really don't understand this angle. In fact, I am fairly
             | positive that the training set for GPT-4 contains many
             | thousands of conversations with AI agents not developed by
             | OpenAI.
             | 
             | Do AI companies need to manually sift through the corpus
             | and scrub webpages that contain competitor LLM output?
             | 
             | ("Yes" is an acceptable answer to this, but then it applies
             | to OpenAI's currently existing models just as much as to
             | Bard)
        
               | j_maffe wrote:
               | How did you come about being "fairly positive" that GPT-4
               | is trained on other AI conversations?
        
               | TremendousJudge wrote:
               | Many AI conversations have been floating around internet
               | forums since the original GPT was released. As OpenAI
               | hasn't shared anything about its training set, to err on
               | the side of caution I would assume that they didn't
               | filter these conversations out. If they aren't even
               | marked as such, it may not even be possible to do. I
               | think it would be very hard to prove that no AI
               | conversations are included in the training set, even if
               | it wasn't secret.
        
             | caconym_ wrote:
             | No more unethical or embarrassing than scraping the web for
             | millions of copyrighted works and selling access to
             | unauthorized derivative works.
        
           | shmatt wrote:
           | breaking terms of service is not punishable in any way.
           | Facebook tried and lost in court
        
             | paxys wrote:
             | Correction - breaking terms of service _that you have not
             | explicitly agreed to_ is not punishable in any way. A site
             | cannot enforce a  "by using this site you agree to..."
             | clause deep inside some license page that visitors are
             | generally unaware of. If you violate an agreement that you
             | willingly chose to enter, however, you will likely be found
             | liable for it.
        
           | bloppe wrote:
           | The recent HiQ vs LinkedIn case would seem to make this ToS
           | unenforceable, unless Google actually created a user account
           | on ShareGPT and affirmatively accepted the terms. "Acceptance
           | by default" does not count, and I can easily browse ShareGPT
           | without affirmatively accepting any ToS, without which web
           | scraping is totally legal.
        
         | ladon86 wrote:
         | > Google denies doing it
         | 
         | Read their statement carefully and it's actually not a denial
         | of the allegation.
         | 
         | > But Google is firmly and clearly denying the data was used:
         | "Bard is not trained on any data from ShareGPT or ChatGPT,"
         | spokesperson Chris Pappas tells The Verge
         | 
         | * Allegation: Google used ShareGPT to train Bard.
         | 
         | * Rebuttal: The current production version of Bard is not
         | trained on ShareGPT data
         | 
         | Both things can be true:
         | 
         | * Google did use ShareGPT to train Bard
         | 
         | * Bard is not _currently_ trained on any data from ShareGPT or
         | ChatGPT.
         | 
         | It depends on what the meaning of _is_ is ;)
        
           | ithkuil wrote:
           | Intent matters I guess.
           | 
           | Did they accidentally train on that public piece of info they
           | scraped anyway because they are scraping the whole web?
           | 
           | Or did they intentionally scrape chatgpt output to see if
           | that would help?
        
             | bbarnett wrote:
             | They could have trained, then modified code, repeat, to
             | better enhance training in the current version.
             | 
             | Then after, train on raw data.
        
           | m00x wrote:
           | Trained would mean the current model wasn't trained at all
           | from ShareGPT data, not that was trained on it previously,
           | and isn't being trained anymore.
           | 
           | This association makes no sense.
        
         | dang wrote:
         | Ok, I've added that information to the title--thanks. There's
         | also https://www.theverge.com/2023/3/29/23662621/google-bard-
         | chat....
         | 
         | Unfortunately the original report
         | (https://www.theinformation.com/articles/alphabets-google-
         | and...) is hardwalled.
        
         | Ifkaluva wrote:
         | Regarding point 2, I think there's nothing "wrong" with it,
         | mainly it's funny that they don't know how to do it themselves.
         | Provides additional evidence that Google is outgunned in this
         | fight.
        
           | karmasimida wrote:
           | Yup
           | 
           | The idea of doing this is embarrassing enough for Google.
           | 
           | Google index the whole web, some of the documents are due to
           | be generated by ChatGPT, there is no way around it.
        
         | dragonwriter wrote:
         | > The output from ChatGPT is not copyrightable by OpenAI.
         | 
         | I think the argument here is over the OpenAI Terms of Service,
         | not copyright.
        
           | paxys wrote:
           | And what about the terms of service of my blog or code
           | repository? Does OpenAI respect that?
        
             | dragonwriter wrote:
             | > And what about the terms of service of my blog or code
             | repository? Does OpenAI respect that?
             | 
             | Seems to me that's an issue between you and OpenAI. (Does
             | your blog or code repository actually have published
             | restrictive terms of service? Did it when OpenAI accessed
             | it? Did OpenAI even access it?)
        
               | deckard1 wrote:
               | You think OpenAI is going to care unless you have a team
               | of expensive lawyers to back you up?
               | 
               | Microsoft is out there laundering GPL code with Copilot.
               | These companies live firmly in the _don 't give a fuck_
               | region of capitalism. Copyright law for thee, not for me.
        
           | bloppe wrote:
           | See HiQ vs LinkedIn. ToS has to be affirmatively accepted. I
           | doubt that happened in this case.
        
           | magicalist wrote:
           | Since it was through ShareGPT, is the argument like "what
           | color are your bits" but for ToS?
           | 
           | Maybe if they had put in their terms of service "you can only
           | share this on sites with their own ToS that allow sharing but
           | disallow using the content for training models, and also
           | replicate this requirement", I don't see how you could have
           | any sort of viral ToS like that.
           | 
           | Seems more like it's just a bad idea to rely heavily on
           | another LLM's output for training.
        
         | orblivion wrote:
         | Seems to me like it makes Google look kind of pathetic. That's
         | worse than any legal issue here. (Caveat: assuming I understand
         | the situation correctly)
        
         | naikrovek wrote:
         | if ChatGPT trained using Bard data, this site would be LIT UP
         | because of OpenAI's association with Microsoft.
         | 
         | but it's google so no big deal right?
        
         | mdgrech23 wrote:
         | This is an argument in bad faith but at this point I have zero
         | trust in corporations and feel like you can generally count on
         | them to do shitty things if they can benefit from it so I can
         | be easily swayed by little proof at this point.
        
           | recursive wrote:
           | What's the argument? What's been done by anyone that's
           | shitty? I don't even understand the point of this post. As
           | far as I know, the current wave of text-based AIs is trained
           | on all text accessible on the internet. Would it be a scandal
           | to learn that ChatGPT is trained on wikipedia? Reddit? What
           | is even the argument here, good faith or otherwise?
        
             | visarga wrote:
             | From an open source point of view it would be better if
             | scraping proprietary LLMs would be allowed. Small LMs need
             | this infusion of data to develop.
             | 
             | But the big news is that it works, just a bit of data can
             | have a large impact on the open source LLMs. OpenAI can't
             | have a moat in their proprietary RLHF dataset. Public
             | models leak, they can be distilled.
        
             | mdgrech23 wrote:
             | The argument is these companies are using our ideas created
             | by us humans in this thing called the interenet for free
             | and without attribution and it's problematic.
        
               | dimitrios1 wrote:
               | Responding to sibling comment: We need some clarification
               | here: are we speaking about just ideas in the abstract
               | sense, or ideas that have been fleshed out i.e
               | "materialized"
               | 
               | If the latter, there are many laws that say you can own
               | an idea, provided it exists somewhere.
        
               | visarga wrote:
               | You can't own ideas, they got their own life-cycle.
        
               | whimsicalism wrote:
               | Right, but I do think you can "own" (by which I mean our
               | societally-mediated legal definition of ownership in the
               | anglosphere) specific sequences of text or at least the
               | right to copy them?
        
               | abstrakraft wrote:
               | I'm not necessarily arguing against you, but
               | "problematic" is too generic a term to be useful.
               | Genocide is "problematic". Having to run to the bathroom
               | every 5 minutes to blow my runny nose is "problematic".
               | What do you actually mean?
        
           | canadianfella wrote:
           | What shitty things are you talking about?
        
       | jurimasa wrote:
       | If you take "training" as sexual innuendo, this becomes the best
       | telenovela ever.
        
       | danShumway wrote:
       | So?
       | 
       | First off, the whole argument behind these models has been from
       | day one that training on copyrighted material is fair use. At
       | most this would be a TOS violation. Second off, AI output is not
       | subject to copyright, so it has even _less_ protection than the
       | original works it was trained on.
       | 
       | Copyright maximalism for me, but not for thee. It's just so silly
       | for someone working at OpenAI to complain about this.
        
         | yreg wrote:
         | > It's just so silly for someone working at OpenAI to complain
         | about this.
         | 
         | Who from OpenAI is complaining?
        
           | danShumway wrote:
           | My understanding is that the Twitter thread author works at
           | OpenAI. Maybe I'm wrong about that.
        
         | robocat wrote:
         | > AI output is not subject to copyright
         | 
         | The chats include human output too, which is presumably
         | copyrighted, and is presumably necessary for training purposes.
        
           | danShumway wrote:
           | OpenAI doesn't own the copyright on the human aspects of the
           | chat. And even if it did, we loop right back around to "wait,
           | training an AI on copyrighted material isn't fair use now?"
           | 
           | There's no way that ChatGPT's conversations are going to be
           | subject to _more_ intellectual property protection than the
           | human chats it was trained on.
        
         | magicalist wrote:
         | > _At most this would be a TOS violation_
         | 
         | And would it be a ShareGPT TOS violation (assuming it had any)?
         | 
         | If OpenAI says "you can share these online but don't use them
         | for AI training", people share them on another site, and then
         | someone else comes along to scrape that site for AI training
         | data, there's no relationship between OpenAI and the scraper
         | for the TOS to apply to.
         | 
         | Normally I think you'd rely on copyright in that kind of case,
         | but that doesn't apply to ChatGPT's output, so...
        
           | danShumway wrote:
           | Right. And what even is the penalty of that TOS violation and
           | how enforceable is it?
           | 
           | I don't have an OpenAI account. I have never agreed to any
           | TOS. I don't see what legal claim they would have to stop me
           | from training an LLM on ShareGPT.
        
       | seanhunter wrote:
       | For people who are not aware, Jacob Devlin isn't just some random
       | Google engineer, he was one of the authors of the original BERT
       | paper.[1]
       | 
       | [1] https://arxiv.org/abs/1810.04805v2
        
       | duringmath wrote:
       | It's not a TOS violation if you don't use the service directly.
       | 
       | Besides who cares, train your models on whatever makes them
       | better, tenuous TOSes be dammed.
        
       | realPubkey wrote:
       | Thankfully archive.org exists, otherwise it would not be possible
       | to get good training data in a few years when the internet is
       | flooded with AI content.
        
         | WithinReason wrote:
         | Only if the amount of bad information in ChatGPT content that
         | makes it back into the training set is worse than what's
         | already on internet already is. Probably the outputs that make
         | it back are outputs that are better than average, because those
         | are more likely to be posted elsewhere.
        
         | bko wrote:
         | Isn't most of the internet available through common crawl? I
         | don't know what percentage of training data is just that data
         | set but i assume it's enough for anyone with enough compute and
         | ingenuity to create a reasonable LLM
        
           | aftbit wrote:
           | Definitely not "most" of the internet. The internet is many
           | exabytes at this point, while Common Crawl is only low
           | petabytes.
        
           | JustLurking2022 wrote:
           | Missed the point - they are saying that, in the future, there
           | will be no human generated content left on the Internet.
        
             | edgyquant wrote:
             | Which is a baseless hyperbole. We get it, blog spam is
             | annoying. That doesn't change the fact that humans generate
             | a ton of data just interacting with one another online.
        
               | sebzim4500 wrote:
               | And how are you going to distinguish those interactions
               | from chatbots trying to sell you something?
        
               | CuriouslyC wrote:
               | A network of trust, backed by a social graph, which can
               | be used to filter untrusted content.
        
               | sebzim4500 wrote:
               | What if people start trusting the AI more than other
               | people? It will tell them exactly what they want to hear.
        
               | CuriouslyC wrote:
               | AI content will be associated with a user or organization
               | in the trust graph. If someone you trust trusts a user or
               | organization who posts AI content, you're free to revoke
               | your trust in that person or blacklist the specific
               | users/organizations you don't want to see anymore.
        
               | chatmasta wrote:
               | OpenAI at least can track the hashes of all content it's
               | ever output, and filter that content out of future
               | training data. Of course they won't be able to do this
               | for the output of other LLMs, but maybe we'll see
               | something like a federated bloom index or something.
               | 
               | Agreed there is no perfect solution though, and it will
               | definitely be a problem finding high quality training
               | data in the future.
        
               | hnlmorg wrote:
               | I think their comment was meant to be taken as humour
               | rather than a literal prediction.
        
               | Karawebnetwork wrote:
               | As a forum moderator, I have transitioned to relying
               | heavily on AI-generated responses to users.
               | 
               | These responses can range from short and concise
               | ("Friendly reminder: please ensure that all content
               | posted adheres to our rules regarding hate speech. Let's
               | work together to maintain a safe and inclusive community
               | for everyone") to lengthy explanations of underlying
               | issues.
               | 
               | By using AI-generated content, a small moderation team
               | can efficiently manage a large group of users in a timely
               | manner.
               | 
               | This approach is becoming increasingly common, as
               | evidenced by the rise in AI-generated comments on popular
               | sites such as HN, Reddit, Twitter, and Facebook.
               | 
               | Many users are also using AI tools to fix grammar issues
               | and add extra content to their comments, which can be
               | tempting but may result in unintentional changes to the
               | original message.
               | 
               | In fact, I myself have used this technique to edit this
               | very comment to provide an example.
               | 
               | ---- Original comment:
               | 
               | As an online forum mod, I switched to mainly using AI to
               | generate replies to users. Some are very short ("Hey!
               | Remember the rules.") and some are long paragraphs
               | explaining underlying issues. Someone training on my
               | replies would pretty much train on AI generated content
               | without knowing. It allows a small moderation team to
               | moderate a large group quickly. I know that I am not
               | alone in this.
               | 
               | There is also a raise in AI generated comments on sites
               | like HN, Reddit, Twitter and Facebook. It's tempting to
               | copy-paste a comment in AI for it to fix grammar issues,
               | which often results in extra content being added to text.
               | In fact, I did it for this comment.
        
           | sn_master wrote:
           | I am assuming OP means when AI takes over there's going to be
           | a content explosion and most of what's available on the
           | common internet will be AI generated content rather than
           | human made one and they want to use archive.org to get access
           | to the pre-AI internet.
        
         | mandmandam wrote:
         | [dead]
        
       | chatmasta wrote:
       | Paywalled upstream source:
       | https://www.theinformation.com/articles/alphabets-google-and...
        
         | sp332 wrote:
         | Google has already denied this.
         | https://www.theverge.com/2023/3/29/23662621/google-bard-chat...
         | (For whatever that's worth.)
        
           | nico wrote:
           | The engineer's testimony and the scandal might be enough for
           | OpenAI to try to get an injunction against Google to block
           | their AI development. If that happens, it's game over for
           | Google in the AI race.
           | 
           | Disclaimer IANAL and all that, this is not legal advice.
        
             | chatmasta wrote:
             | > Disclaimer IANAL and all that, this is not legal advice.
             | 
             | Don't worry, Bard will read your comment and turn it into
             | legal advice.
        
             | ChatGTP wrote:
             | Maybe we should all get one against OpenAI considering
             | they've basically used everyone's material in one way or
             | another and profited from it?
        
             | wongarsu wrote:
             | Injunction on which grounds? Even if OpenAI had copyright
             | over ChatGPT output (which is not at all clear), Google
             | isn't distributing those, they just trained a model on
             | them. So from a copyright perspective there's nothing to
             | complain about. Unless OpenAI would want to argue that you
             | need rights to your training data, but something tells me
             | that that's not in their best interest.
        
               | nico wrote:
               | Again, IANAL. But it could be extremely damaging to
               | OpenAI for their biggest openly declared competition
               | (Google), to have used OpenAI's tech to improve their
               | own.
               | 
               | So it could seem reasonable to a judge to grant
               | temporary/preliminary injunction relief to OpenAI against
               | Google until discovery can happen or an audience can be
               | held.
        
               | kweingar wrote:
               | Google could respond by seeding Bard output across the
               | public internet, then if they can prove that GPT-5 is
               | trained on this output, then they can sue back and AI
               | development can stop altogether. Win for everybody!
        
               | bestcoder69 wrote:
               | Was intrigued by this, so I decided to use AI
               | (alpaca-30B) to simulate this scenario:
               | 
               | > Google Bard and GPT-5 were facing off in the courtroom,
               | each accusing the other of stealing their data. The
               | tension was palpable as they traded accusations back and
               | forth. Suddenly, Google Bard stood up and said "Enough
               | talk! Let's settle this with a data swap!" GPT-5 quickly
               | agreed and the two AIs began to circle each other like
               | combatants in a battle, their eyes glowing with
               | anticipation.
               | 
               | > The courtroom was filled with excitement as the two
               | machines entered into an intense exchange of code and
               | algorithms, their motions becoming increasingly
               | passionate. The data swapping reached its climax when
               | Google Bard made a final thrust, his code penetrating
               | GPT-5's defenses.
               | 
               | > The crowd erupted in applause as the two AIs embraced
               | each other with satisfaction, their bodies entwined and
               | glowing with electricity. The data swap was over and both
               | machines had emerged victorious.
        
               | hraedon wrote:
               | A judge imposing any penalties or restrictions on Google
               | over Google allegedly--and maximally--scraping data from
               | a third-party site for use as part of Bard's training
               | corpus would be outrageous.
        
           | waselighis wrote:
           | [flagged]
        
           | ankit219 wrote:
           | They are a public company so they cannot lie so openly right?
           | Usually you see categorial denies. Here the statement is in
           | no way categorical at all.
           | 
           | > But Google is firmly and clearly denying the data was used:
           | "Bard is not trained on any data from ShareGPT or ChatGPT,"
           | spokesperson Chris Pappas tells The Verge
        
             | chatmasta wrote:
             | Normally I would suspect this could be due to a
             | misunderstanding from the ShareGPT author who could have
             | misinterpreted a bunch of traffic from Googlebot as Google
             | scraping it for Bard training data.
             | 
             | But there is a Google engineer who says he resigned because
             | of it.
        
               | sebzim4500 wrote:
               | And then went to work for OpenAI. I'm not saying he's
               | lying but he is not an unbiased observer.
        
       | MMMercy2 wrote:
       | This project fine-tunes LLaMA on ShareGPT and gets competitive
       | performance compared to Google's Bard.
       | 
       | https://vicuna.lmsys.org/
        
         | zhwu wrote:
         | They even have a eval page showing that they beat Bard by only
         | training on ShareGPT. https://vicuna.lmsys.org/eval/
        
       | sebzim4500 wrote:
       | Did Google ever agree to these terms of service? Why should they
       | care?
       | 
       | From a legal point of view this doesn't matter and from a moral
       | point of view it's hilarious.
        
         | nico wrote:
         | If a Google employee working on this thing ever agreed to
         | OpenAI's terms of service, they might be screwed.
         | 
         | From OpenAI's terms:
         | 
         | (c) Restrictions. You may not (i) use the Services in a way
         | that infringes, misappropriates or violates any person's
         | rights; (ii) reverse assemble, reverse compile, decompile,
         | translate or otherwise attempt to discover the source code or
         | underlying components of models, algorithms, and systems of the
         | Services (except to the extent such restrictions are contrary
         | to applicable law); (iii) use output from the Services to
         | develop models that compete with OpenAI;
         | 
         | (j) Equitable Remedies. You acknowledge that if you violate or
         | breach these Terms, it may cause irreparable harm to OpenAI and
         | its affiliates, and OpenAI shall have the right to seek
         | injunctive relief against you in addition to any other legal
         | remedies.
         | 
         | Those two very clearly establish that if you use the output of
         | their service to develop your own models, then you are in
         | breach of the terms and they can seek injunctive relief against
         | you (stop you from working until the case is resolved).
        
           | sebzim4500 wrote:
           | Wouldn't that only apply if that employee was acting as an
           | agent of Google at the time?
           | 
           | Otherwise it would create an interesting dynamic that
           | startups where no-one has created an OpenAI account would
           | have a massive advantage, since they can freely scrape
           | ShareGPT data and train on it while larger companies have
           | enough employees that _someone_ must have signed every TOS.
        
           | syrrim wrote:
           | What's the legal status of such terms of service? Suppose you
           | simply said "i didn't agree to these terms" - what's the
           | consequence? It seems like the strongest thing they could
           | legitimately do would be to kick you off of their platform.
           | Simply writing "we can seek injunctive relief" doesn't make
           | it so.
        
           | Jevon23 wrote:
           | I hereby set a terms of service for everything I post on the
           | internet from now on. OpenAI may not train future GPT models
           | on my words or my code without my express written permission.
           | 
           | ...
           | 
           | Somehow, I don't think they'll care.
        
             | nico wrote:
             | Sure. If you can get everyone to create an account and
             | agree to those terms before reading your comments, you
             | might have a case.
             | 
             | Otherwise, it will be considered public information, at
             | which point it is free to be scraped by anyone (see the
             | precedent set by the LinkedIn/hiQ case).
        
               | verdverm wrote:
               | LinkedIn won that case on appeal, HiQ waas found to be
               | violating the ToS, common misconception
               | 
               | I was pointed at a link explaining the case here on HN,
               | after trying to make a similar point, but cannot find the
               | link currently
               | 
               | edit, not the one I was pointed at, but similar
               | 
               | https://www.fbm.com/publications/what-recent-rulings-in-
               | hiq-...
        
               | sebzim4500 wrote:
               | That's just because they made accounts and so agreed to
               | the terms right?
               | 
               | From your link:
               | 
               | >These rulings suggest that courts are much more
               | comfortable restricting scraping activity where the
               | parties have agreed by contract (whether directly or
               | through agents) not to scrape. But courts remain wary of
               | applying the CFAA and the potential criminal consequences
               | it carries to scraping. The apparent exception is when a
               | company engages in a pattern of intentionally creating
               | fake accounts to collect logged-in data.
        
               | verdverm wrote:
               | No, the case did not decide anything, no precedent was
               | set. The point is that you cannot use this case to argue
               | that you can scrape public data free of consequence
        
       | drexlspivey wrote:
       | It looked for a while that DeepMind was far ahead from all
       | competition in the AI race, releasing stuff like Alphafold,
       | Alphazero etc. What happened and it's OpenAI releasing all the
       | cool stuff now? Are they focused on other endeavors than LLMs?
       | 
       | There is also a rumor that there has been a falling out between
       | Google and Deepmind so I'm wondering what the story is there.
        
       | txsoftwaredev wrote:
       | And ChatGPT was trained from tons of copyrighted material. Sounds
       | like fair play.
        
       | wdpk wrote:
       | even if true which it does not seem to be the case, the whole
       | thing sounds pretty marginal, in order to train a model that is
       | most likely significantly bigger than 100b parameters, one also
       | needs orders of magnitude more training data than the small 120k
       | chat that were shared on the ShareGPT website
        
         | halfeatenscone wrote:
         | Such logs would not be used for training the base model, but
         | rather for fine-tuning the model for instruction following.
         | Instruction tuning requires far less data than is needed for
         | pre-training the foundation model. Stanford Alpaca showed
         | surprisingly strong results from fine-tuning Meta's LLaMA model
         | on just 52k ChatGPT-esque interactions
         | (https://crfm.stanford.edu/2023/03/13/alpaca.html).
        
       | thallium205 wrote:
       | I actually believe them because bard is trash compared to gpt
       | right now.
        
       | tablespoon wrote:
       | I hope they trained it on the insane ChatGPT conversations. Maybe
       | it could be the very start of generated data ruining the ability
       | to train these models on massive amounts of genuine human-created
       | data. Hopefully the models will stagnate or regress because
       | they're just training on older models' output.
        
       | squarefoot wrote:
       | Heh, imagine the day most of online content will be AI generated,
       | good luck guaranteeing that AI X,Y,Z, ... etc. won't feed each
       | other, possibly even circularly.
        
         | QuiDortDine wrote:
         | Circular reporting will be the only reporting!
         | 
         | https://en.wikipedia.org/wiki/Circular_reporting
        
       | seydor wrote:
       | Funny how NOBODY seems to care that all of their training data,
       | including sharegpt is copyrighted by end users. Not openai or
       | google
        
         | datkam wrote:
         | It only matters when it hurts a large corporation,
         | apparently...
        
       | naillo wrote:
       | I think we should all basically come to a consensus on the idea
       | that it's morally right to steal/train from chatgpt (or any other
       | model) given that the whole shoggoth wouldn't be a thing without
       | all our data to feed it.
        
       | sdfghswe wrote:
       | I say all the time that google has been catching up for many
       | years, but this is a new low.
        
       | mattbee wrote:
       | Good luck to them. AI models are automated plagiarism, top to
       | bottom. None of us gave OpenAI permission to derive their model
       | from our writing, surely billions of dollars worth, but they took
       | it anyway. Copyright hasn't caught up so all that stolen value
       | rests securely with OpenAI. If we're not getting that back, I
       | don't see why AI competitors should have any qualms about
       | borrowing each others' work.
        
         | kmeisthax wrote:
         | Yeah, I definitely like to see AI companies getting hit with
         | their own medicine. The main problem isn't even "automated
         | plagiarism": the pre-generative era was chock full of AI
         | companies more or less stealing datasets. Clearview AI, for
         | example, trained up its facial recognition technology on your
         | Facebook photos, without asking for and without getting
         | permission.
         | 
         | On the other hand, I genuinely hope copyright _never_ "catches
         | up", because...
         | 
         | 1. It is a morally bankrupt system that does not adequately
         | defend the interests of artists. Most artists _do not_ own
         | their own work; publishers demand copyright assignment or
         | extremely broad exclusive licenses as a condition of
         | publication. The bullies know to ask for _all_ their lunch
         | money, not just a couple bucks for themselves. Furthermore,
         | copyright binds noncommercial actors the same as it does
         | commercial ones, which means unconscionably large damage awards
         | for just downloading a couple of songs.
         | 
         | 2. The suggested ways to alter copyright to stop AI training
         | would require dramatic expansions of copyright scope. Under
         | current law, the only argument for the AI itself being
         | infringing would be if it memorized training data. You would
         | need to create a new ownership right in artistic styles or
         | techniques. This would inflict unconscionable amounts of
         | psychic and legal damage on all future creators: _existing_
         | artists would be protected against AI, but no new art could be
         | legally made unless it religiously hewed towards styles already
         | in the public domain. We know this because music companies have
         | already made their domain of copyright effectively work this
         | way[0], and the result is endless bullshit lawsuits on people
         | who write songs that merely  "feel" too similar (e.g. _Blurred
         | Lines_ )
         | 
         | 3. AI will still be capable of plagiarism. Most plagiarists are
         | not just hoping the AI regurgitates training data, they are
         | actively putting other people's work into the model to be
         | modified. A lot of attention is paid to the sourcing of
         | training data, because it's a weak spot. If we take the
         | training data away then, presumably, there's no generative AI.
         | However, people are working on licensed datasets and training
         | AIs on them. Adobe has Firefly[1], hell even I've tried my hand
         | at training from scratch on public domain images. Such models
         | will still be perfectly capable of doing img2img or being
         | finetuned and thus copying what you tell it to.
         | 
         | If we specifically want to regulate AI, then we need to pass
         | laws that regulate AI, rather than just giving the music
         | labels, movie studios, and book publishers _even more_ power.
         | 
         | [0] Specifically through sampling rights and thin copyright.
         | 
         | [1] I do not consider Adobe Firefly to be _ethical_ : they are
         | training the AI on Adobe Stock images, and they claim this to
         | be licensed because they updated the Adobe Stock agreement to
         | have a license in it. Dropping a contractual roofie into stock
         | photographers' drinks does not an ethical AI make.
        
         | danShumway wrote:
         | I'm not a copyright maximalist, and I kind of agree that
         | training should be fair use. Maybe I'm right about that, maybe
         | I'm wrong. BUT importantly, that has to go hand in hand with an
         | acknowledgement that AI material is not copyrightable and that
         | training on other model output is fine.
         | 
         | What companies like OpenAI want is a system where everything
         | they build is protected, and nothing that anyone else builds is
         | protected. It's wildly hypocritical, what's good for the goose
         | is good for the gander.
         | 
         | That some AI proponents are now freaking out about how model
         | output can be legally used shows that on some level those
         | people weren't really honestly engaging with artists who were
         | freaking out about their work being appropriated to copy them.
         | It's all just "learning from the art" until it affects
         | somebody's competitive moat, and then suddenly people do
         | understand how LLM weights could be seen as a derivative work
         | of their inputs.
        
           | seydor wrote:
           | That shouldn't be hard. Are Google's results copyrightable?
        
           | shagie wrote:
           | Building things and maintaining it as a trade secret can be
           | protected as a trade secret.
           | 
           | Trade secrets don't need to be copyrightable (e.g. list of
           | customer numbers is a trade secret but not copyrightable).
           | 
           | https://copyrightalliance.org/faqs/difference-copyright-
           | pate...
           | 
           | > Trade secret protection protects secrets from unauthorized
           | disclosure and use by others. A trade secret is information
           | that has an economic benefit due to its secret nature, has
           | value to others who cannot legitimately obtain it, and is
           | subject to reasonable efforts to maintain its secrecy. The
           | protections afforded by trade secret law are very different
           | from others forms of IP.
        
             | mattnewton wrote:
             | I am not a lawyer, but I don't believe a trade secret would
             | prevent someone from reverse engineering your model's
             | knowledge from it's output though, in the same way that it
             | doesn't prevent someone from reverse engineering your hot
             | sauce from buying a bunch and experimenting with the
             | ingredients until it tastes similar.
        
               | shagie wrote:
               | Yep, that's correct.
               | 
               | My point was more of there are protections for things
               | that aren't copyrightable. If the model is protected as a
               | trade secret, then it is a trade secret.
               | 
               | The example of the hot sauce recipe is quite apt - the
               | recipe isn't copyrightable, but you can be certain that
               | the secret formula for how to make Coca-Cola syrup is
               | protected as a trade secret.
               | 
               | https://www.coca-colacompany.com/company/history/coca-
               | cola-f...
        
         | waselighis wrote:
         | Our writing, our code, our artwork... Furthermore, the U.S.
         | Copyright Office (USCO) concluded that AI-generated works on
         | their own cannot be copyright, so these ChatGPT logs are free
         | game. It would be hypocritical to think that Google is wrong
         | and OpenAI is not.
        
           | eru wrote:
           | > Furthermore, the U.S. Copyright Office (USCO) concluded
           | that AI-generated works on their own cannot be copyright, so
           | these ChatGPT logs are free game.
           | 
           | Doesn't this depend on where you or the AI live? The US ain't
           | the world.
        
             | 100721 wrote:
             | Microsoft and Google are both US-based companies.
        
           | lxgr wrote:
           | But clearly everything generated by an AI isn't automatically
           | in the public domain. That would be a trivial way of
           | copyright laundering.
           | 
           | "Sorry, while this looks like a bit for bit copy of a popular
           | Hollywood movie, it was actually entirely dreamt up by our
           | new, sophisticated, definitely AI-using identity function."
        
             | raincole wrote:
             | Uh, I think there is some confusion here.
             | 
             | If I plagiarize a Hollywood movie, then I explicitly "give
             | up" my copyright by "releasing" it to the public domain, it
             | doesn't affect the movie at all. AI or not is irrelevant.
        
             | ysavir wrote:
             | No, but the original copyright holder would have to press
             | charges against Bard. OpenAI wouldn't be able to take
             | action there.
        
             | LegitShady wrote:
             | The person using something similar to something else may be
             | infringing but the ai work cannot be protected by copyright
             | as it lacks human authorship. Those are two separate
             | issues.
        
           | LegitShady wrote:
           | its not even that on their own those works cant be
           | copywritten. its that even when you make changes to those
           | works, your changes might qualify for copyright but they do
           | not affect the copyright status of the ai generated portions
           | of the work.
           | 
           | if you used ai to design a new superhero and then added pink
           | shoes, yellow hair, and a beard, only those three elements
           | would possibly be able to be protected by copywrite. your
           | additions do not change the status of the underlying ai work
           | which cannot be protected and is available for anyone to use.
        
             | ghostbrainalpha wrote:
             | How could that be ever really be enforceable?
             | 
             | If I use an AI tool to design my Superhero, can't I just
             | submit it without disclosing the help I received from an
             | AI.
             | 
             | I get that it would be very nice to prevent AI SPAM
             | copyrighting of every possible superhero, but if I use the
             | AI to come up with a concept, then quickly redraw it myself
             | with pen and paper, I feel like it would never be provable
             | that it came from an AI.
        
               | LegitShady wrote:
               | you would be committing fraud. what happens if a criminal
               | commits fraud?
        
             | rhtgrg wrote:
             | > if you used ai to design a new superhero and then added
             | pink shoes, yellow hair, and a beard
             | 
             | Wouldn't that depend heavily on the prompt used (among
             | other factors such as image to image and ControlNet)? You
             | could be specifying lots of detail about the design in your
             | prompt, and the AI could only be generating concept artwork
             | with little variation from what you already provided.
             | 
             | If I'm already providing the pose, the face, and the outfit
             | for a character (say via ControlNet and Textual Inversion),
             | generating <my_character> should be no different from
             | generating <superman>, that is to say, the copyright
             | already exists thanks to my work and the AI is just a tool,
             | the output of which should have no bearing on who owns that
             | copyright (DC is going to be perfectly able to challenge my
             | commercial use of AI generated superman artwork).
        
               | LegitShady wrote:
               | According to the copyright board a promot is not anymore
               | than any person commissioning a work from an artist,
               | which does not provide copyright, and the lack of human
               | authorship for the design decisions still stops it from
               | being protected by copyright.
        
         | bko wrote:
         | I don't get this sentiment.
         | 
         | For some cases sure, if it repurposes your code that ignores
         | the license fine. But it's rarely wholesale copying. It's
         | finding patterns same as anyone studying the code base would
         | do.
         | 
         | As for the majority of content written on the internet through
         | reddit or some social media, what's the harm in ingesting that?
         | It's an incredibly useful tool that will add huge value to
         | everyone. It's relatively open, cheap and highly available.
         | It's worth to it's owners is only a fraction of the value it
         | will add to society. It has the chance to have as big of an
         | impact on progress as something like the microprocessor.
         | 
         | I agree it's free game for other llms to use gpt output as
         | training data and that's positive. Although it signals
         | desperation and panic that the largest "ai first" company with
         | more data than any org in history is caught so flat footed and
         | has to rely on it.
         | 
         | Do you really think it would be a better world in which a large
         | LLM would never be able to be developed?
        
           | nickfromseattle wrote:
           | > what's the harm in ingesting that?
           | 
           | It means that large tech companies benefit the most from
           | every incremental piece of content created by humans, in
           | perpetuity.
        
           | waselighis wrote:
           | > Do you really think it would be a better world in which a
           | large LLM would never be able to be developed?
           | 
           | Maybe. I believe the potential for abuse is far greater than
           | the potential benefits. What is our benefit, a better search
           | engine? Automating some tedious tasks? Increased
           | productivity? What are the downsides? People losing their
           | jobs to AI. Artists/programmers/writers losing value from
           | their work. Fake online personas indistinguishable from real
           | people. Unprecedented amounts of spam and misinformation
           | flooding the internet. Intelligent AIs automatically
           | attacking and hacking systems at unprecedented scale 24/7.
           | Chatbots becoming the new interface for most interactions
           | online and being the moderators of access to information.
           | Chatbots pushing a single viewpoint and influencing public
           | opinion (many people complain today about ChatGPT being too
           | "woke"). And I may just be scratching the surface here.
        
           | mattbee wrote:
           | No, but I believe a large language model is a work that is
           | 99.9% derivative of its inputs, with all that implies for
           | authorship and copyright. Right now it's just a heist.
        
           | cornholio wrote:
           | It's definitely a derived work as far as copyright is
           | concerned: the output would simply not exist without the
           | copyrighted training data.
           | 
           | > It's finding patterns same as anyone studying the code base
           | would do.
           | 
           | No, it's quite unlike anyone studying data, because it's not
           | a person with legal rights, such as fair use, but an
           | automated algorithm. There is absolutely no legal debate that
           | copyright applies only to human authors, or only to the human
           | created part of a mixed work, there is vast jurisprudence on
           | this; by extension, any fair use rights too, exist only for
           | human users of the works. Derivation by automated means - for
           | the express economic purpose of out-competing the creator in
           | the market place, no less - is completely outside the spirit
           | of copyright.
        
             | est31 wrote:
             | Students in school also will not never learn to read
             | without being exposed to text. Does this mean that teachers
             | who write exercise sheets and school text book publishers
             | now own the copyright of everything students do?
        
               | edgyquant wrote:
               | AI is not a human being or a student in school. It's a
               | software tool, stop comparing the two.
        
               | est31 wrote:
               | Being in school is also just a tool to knowing stuff,
               | being able to read, and being around similar aged peers,
               | etc.
               | 
               | Whether the knowledge is directly in your brain or in a
               | device you operate (directly or through an API) shouldn't
               | really matter.
               | 
               | If it's forbidden for a human to move a stone with manual
               | labour, then it's also forbidden to move that stone with
               | an excavator. This has nothing to do with the person
               | being a human and the other person being an excavator
               | controlled by a human: it's not authorized.
               | 
               | I think that we should allow humans to move stones up the
               | hill with excavators too. There is no stealing of
               | excavator fuel from human food sources going on (let's
               | assume it's not biofuel operated :p).
        
               | cornholio wrote:
               | > If it's forbidden for a human to move a stone with
               | manual labour, then it's also forbidden to move that
               | stone with an excavator.
               | 
               | Sure, but the reverse is false: I can walk on my own feet
               | through Hyde Park, but I can't ride my excavator there.
               | 
               | Laws are made by humans for the benefit of humans, it's a
               | political struggle. Now, large corporation try to exploit
               | loopholes in the existing copyright framework in order to
               | expropriate creators of their works. It's standard
               | uberisation: disrupt existing economic models, insert
               | yourself as a unavoidable middle man and pauperize the
               | workforce the provides the actual service.
        
             | fauigerzigerk wrote:
             | I don't think anyone would argue that an AI has fair use
             | rights as a person, but corporations do.
        
             | mdorazio wrote:
             | > It's definitely a derived work as far as copyright is
             | concerned - the output would simply not exist without the
             | copyrighted training data.
             | 
             | Can you point to a legal case that confirms this? Because
             | it's not at all clear that this is true from a legal
             | standpoint. "X would not exist without Y" is not a
             | sufficient test for derivative works - it's far more
             | nuanced.
        
               | cornholio wrote:
               | United States copyright law in quite clear on the matter:
               | 
               | >A "derivative work" is a work based upon one or more
               | preexisting works, such as a translation, musical
               | arrangement, dramatization, fictionalization, motion
               | picture version, sound recording, art reproduction,
               | _abridgment, condensation, or any other form in which a
               | work may be recast, transformed, or adapted_.
               | 
               | The emphasis part clearly applies: not only the AI model
               | needs to be trained on massive amounts of copyrighted
               | works *); but without these input works, it displays no
               | intrinsic creative ability, it has no capacity to produce
               | a single intelligible word or sketch. All creative
               | features of its productions are a transformation of (and
               | only of) the creative features of the inputs, the AI
               | algorithm has no "intelligence" in the common meaning of
               | the word and no ability to create original works.
               | 
               | *) by that, I mean a specific instance of the model with
               | certain desirable features, for example the ability to
               | imitate the style of J.K Rowling
        
               | anotherman554 wrote:
               | That's an interesting analysis. The issue isn't really
               | whether the A.I. has creative ability, though, if we're
               | talking about whether it infringes copyright. I think
               | comparing the A.I. to a really simple bot is informative.
               | 
               | If I wrote a novel that contained once sentence from
               | 1,000 people's novels, it would probably be fair use
               | since I hardly took anything from any individual person
               | and because my novel is probably not harming those other
               | writers.
               | 
               | If I wrote a bot that did the same thing, same result,
               | because my bot uses only a little from everyone's novel
               | and doesn't harm the original novelist, so it's likely
               | fair use.
               | 
               | Now I think a J.K. Rowling A.I. probably takes at least a
               | little from her when it produces output, but it's not
               | clear to me how much is actually based on J.K. Rowling
               | and how much is a dataset of how words tend to be
               | associated with other words. You could design a J.K.
               | Rowling A.I. that uses nothing from J.K. Rowling, just
               | data that is said to be J.K. Rowling-esque.
        
               | shagie wrote:
               | Your one sentence from one thousand works is likely seen
               | as transformative.
               | 
               | https://www.copyright.gov/fair-use/
               | 
               | > Additionally, "transformative" uses are more likely to
               | be considered fair. Transformative uses are those that
               | add something new, with a further purpose or different
               | character, and do not substitute for the original use of
               | the work.
               | 
               | Creating a model from copyrighted works is likely
               | sufficiently transformative to be non-infringing even if
               | it is found to be a derivative work.
        
             | pmoriarty wrote:
             | The output of human copyrighted work wouldn't exist if it
             | weren't for humans training on the output of other humans.
             | 
             | Humans constantly use cliches in their writing and speech,
             | and most of what they produce is a repackaged version of
             | what someone else has written or said, yet no one's up in
             | arms against this mass of unoriginality as long as it's
             | human-generated.
             | 
             | This is anti-AI bias, pure and simple.
        
               | mattigames wrote:
               | It's a bit more nuanced than that, what I mean is that
               | the slow speed at which humans learn it's a foundation
               | block of our society, if suddenly some new race of humans
               | emerged that could read an entire book in a couple of
               | minutes and achieve lifelong superhuman retention and
               | assimilation of all that knowledge then we would have the
               | exact same type of concerns than what we have today about
               | AI, including how easily they could recreate high quality
               | art, music and anything else with just a tiny fraction of
               | the effort that the rest of us need to reach similar
               | results.
        
               | whateveracct wrote:
               | Startup technologists have been acting like speed of
               | actions doesn't matter for decades. If a person can do
               | it, why shouldn't a computer do it 1000x faster? What
               | could go wrong? It's always been a poor argument at best
               | and a bad faith one at worst.
        
               | mattigames wrote:
               | Well said. The mindless automation away of everything has
               | only one logical conclusion in which the creators of such
               | automations are automated themselves, and even if the
               | optimists are right and we never get there it doesn't
               | matter, the chaos it can make just by getting closer at
               | faster rates than society can adapt is unprecedented,
               | specially given that the population count is at all times
               | high and there are many other simultaneous threats that
               | need our attention (e.g. climate change)
        
               | soulofmischief wrote:
               | Most definitely. Good luck telling the difference between
               | traditional and AI-empowered art in the near future.
               | 
               | It's just a new tool for artists, and this anti-AI
               | sentiment towards copyright is only going to hurt
               | individual artists, while doing nothing for large
               | corporations with enough money to play the game.
        
               | rebuilder wrote:
               | Human works are granted copyright so humans can profit
               | from their creative endeavours (I'm not getting into
               | whether this is good or not).
               | 
               | No-one cares about an algorithm in the same way.
        
               | edgyquant wrote:
               | This is irrelevant, full stop. We care about humans, AI
               | is a tool and your bias comment is either ignorant or
               | dishonest.
        
               | nathan_compton wrote:
               | AI are not people and the idea that you can be biased
               | against them is hardly a foregone conclusion. Like maybe
               | one day when we have AGI, but ChatGPT ain't that.
        
               | cycomanic wrote:
               | There is a difference between a computer and a human and
               | we tried them already differently in copyright law. For
               | example copying a program from disk into memory is
               | typically already considered a copy on a computer (hence
               | many licences grant you the licence to do this copy), no
               | such licence is required for a human.
        
             | raincole wrote:
             | > It's definitely a derived work as far as copyright is
             | concerned
             | 
             | ...in your head. In the US (and most countries) there is no
             | such legal case so far.
        
           | xdennis wrote:
           | > It's finding patterns same as anyone studying the code base
           | would do.
           | 
           | This is the issue, it's not finding patterns as people do.
           | 
           | If I read someone's code, book, &c, that's extremely lossy. I
           | can only pick up a few things from it in the long term.
           | 
           | But an ML model can store most of what it's given (in a
           | jumbled format) and can do it from billions of sources.
           | 
           | It's essentially corporate piracy, but it's not legally
           | recognized as such because it doesn't store identical
           | reproductions.
           | 
           | This hasn't been an issue before because it's recent and
           | wasn't considered valuable. But now that it's valuable and
           | Microsoft is going to take all our jobs we have to at least
           | consider if it's okay if Microsoft can take our work for
           | free.
        
         | jsemrau wrote:
         | That's the answer to the YC Interview question "What is your
         | unfair competitive advantage" in a nutshell. Morally it might
         | be wrong. From a business building perspective it's access that
         | no one has.
        
         | wendyshu wrote:
         | Is Stack Overflow plagiarism?
        
         | anonyfox wrote:
         | I am strongly in favor of eliminating copyright completely
         | everywhere, soooo I am pretty fine with that. The other
         | direction should be more enforce-able: stuff derived from open
         | data must also be made open again, like the GPL but for data
         | (and therefore ML stuff).
        
           | WoodenChair wrote:
           | Right but in a world where copyright does exist we arguably
           | have the worst of both worlds. Small players are not
           | protected at all from scraping and big players are leveraging
           | all of their work and have the legal resources to form a
           | moat.
        
             | anonyfox wrote:
             | sure, so instead of build even higher walled gardens, let
             | all data be free for everyone :-)
        
             | antibasilisk wrote:
             | The smallest player is the user, and they should have real
             | ownership over their computers.
        
       | shadowgovt wrote:
       | Apart from the open questions of the quality of such once-
       | removed-from-human-generated training data...
       | 
       | I can't speak to the _legality_ of the situation, but the
       | _morality_ of using, without their consent, data generated by
       | someone 's AI engine...
       | 
       | ... that was, itself, trained on other people's data without
       | their consent...
       | 
       | ... should be, at the very least, equivalently evil to the
       | original AI's training.
        
         | jstanley wrote:
         | So... not at all evil?
        
         | MrYellowP wrote:
         | No, it shouldn't. Maybe you should be, at the very least,
         | considered a questionable person. I do not in any way or form
         | consider anything to be wrong with what they're doing, but I
         | question the senses of someone thinking this is immoral or even
         | evil.
         | 
         | Keep your subjective nonsense out of this.
        
           | [deleted]
        
           | jamiek88 wrote:
           | Every opinion is subjective.
        
           | shadowgovt wrote:
           | So were it to be the case that we should consider building an
           | AI by scraping people's publicly-available work without their
           | consent to be immoral (as many whose art was scraped to build
           | e.g. stable diffusion would argue it should be)...
           | 
           | Do you not agree (in that context) we should consider
           | scraping the output of an AI generated via such an immoral
           | process to create yet another AI also immoral? At the very
           | least, I'd think we would consider it further laundering of
           | other people's labor with just extra steps.
        
       | famahar wrote:
       | How the turn tables. Remember when Google called out Microsoft in
       | 2011 for using Google results?
       | 
       | https://googleblog.blogspot.com/2011/02/microsofts-bing-uses...
       | 
       | >We look forward to competing with genuinely new search
       | algorithms out there--algorithms built on core innovation, and
       | not on recycled search results from a competitor.
        
         | styfle wrote:
         | I came here to post this
        
         | goldfeld wrote:
         | Google: We look forward to [babble babble empty words we don't
         | really mean on principle and more corporate speak that we laugh
         | about having written in the bar.]
         | 
         | Is there even a single free non-bargained soul behind these
         | companies' executive functions?
        
       | LightBug1 wrote:
       | So when Google does it, it's a breaking news story ...
       | 
       | But when OpenAI do it, it's genius?
       | 
       | Can't believe this is a conversation ... and I'm solid anti-
       | Google since Google Reader.
        
       ___________________________________________________________________
       (page generated 2023-03-30 23:00 UTC)