(C) Daily Kos
This story was originally published by Daily Kos and is unaltered.
 . . . . . . . . . .


A Handful of Thoughts on Where AI Learned to Speak: From Business, Racists, and Thieves [1]

['This Content Is Not Subject To Review Daily Kos Staff Prior To Publication.', 'Backgroundurl Avatar_Large', 'Nickname', 'Joined', 'Created_At', 'Story Count', 'N_Stories', 'Comment Count', 'N_Comments', 'Popular Tags']

Date: 2023-04-19  

The Washington Post has a very interesting deep dive into how Google's imitative AI was trained. It is a fascinating look at the sources of the training data. A few thoughts below:

This focused on Google's training data, but I suspect that given the requirement for huge training sets, other imitative AI systems training data looks very similar. Especially considering how OpenAI, for example, is suddenly intent on keeping their training data and models closed. The largest category was Business and Industrial with News third (16 and 13 percent respectively). I suspect that since a lot of academic material is behind paywalls it was harder to get access to, but that means the system as not trained on accurate data, per se, but on reporting, opinion, and "the first draft" of history. It may also have a heavy slant toward capital -- its top business site was fool.com, not a friend to the working person. It definitely learned to hate -- places like Russia Today and Vdare, a vicious hate site, are in the data set. Stormfront and 4chan were also included in the data. So when you ask a question that touches upon race, it could very well have learned the pattern of racists to reply. It relied on thieves: More than twenty-fixe sites for pirated e-books were in the data set. More potential theft. According to the Post more than 200 million images with copyright symbols attached were in the data set. It's possible they had permission to use all 200 million for their training data, but I wouldn't bet on it. It tended to filter out innocuous LGBTQ related words, thinking them profane.

There is a lot more in the article, including how skewed the set was toward western religions, including far-right offshoots of those religions. It also highlighted how personal information was swallowed whole by these tools.

But the larger takeaway here is that these appear to be deeply, deeply flawed training data to converse with the world. The set appears to have done a poor job weedy out racism, is tilted toward the perspective of capital, doesn't mind getting help from people who stole e-books, and has minimized LGBTQ voices. And since many of these tools provide no source for their claims, and the ones that do have been known to invent their sources, you cannot know when this skewed perspective is infecting the answers.

At a minimum, this shows the necessity for opening up the algorithms and training data to outside inspection. These biases need to be known so that people can fully understand the danger of relying too deeply on these tools. This training data also demonstrates that the imitative AI tools are just another means of perpetuating a power structure with the people financing tech at the top. They have no qualms about working with racist sites, taking the work of creative people for their own use with no payment, minimizing LGBTQ voices and pushing the opinions of capital. They present these things as authoritative, but they have the same biases as the people who make and fund them.

Meet the new robot overlord, same as the old robot overlord.

[END]
---
[1] Url: https://www.dailykos.com/stories/2023/4/19/2164735/-A-Handful-of-Thoughts-on-Where-AI-Learned-to-Speak-From-Business-Racists-and-Thieves

Published and (C) by Daily Kos
Content appears here under this condition or license: Site content may be used for any purpose without permission unless otherwise specified.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/dailykos/