(C) Daily Kos
This story was originally published by Daily Kos and is unaltered.
 . . . . . . . . . .


I know we officially hate AI, but let's not be anti-science [1]

['This Content Is Not Subject To Review Daily Kos Staff Prior To Publication.']

Date: 2024-06-18  

I’m prepared to get a lot of hate for this. but talking to my fellow Democrats about AI is like talking to Republicans about COVID-19 and climate change. We seem to be picking up whatever misinformation is the most validating and spreading it around as much as possible, and that includes sources that I would have otherwise considered responsible and trustworthy but now I have to reevaluate.

I’ve been an avid programmer since I was a 6, I have a software engineering degree, and I’ve been a software developer in the work force in some capacity or another for almost 30 years now. AI has been of interest to me for a number of years now, since before the big AI gold rush hit. When I see people spreading misinformation (that I have no doubt that they completely believe), it makes me cringe, because I had hoped we were better than just disbelieving science just because we have strong negative opinions about it.

Anyway, I’m going to specifically call out this section of Hunter’s highly recommended article, because unfortunately misinformation tends to shape opinions:

[...] The current corporate orgasms are for the best-yet commercialized "Large Language Model," which is the generalized name for an algorithm that does the following: Collect and store as much textual (and visual, and audio, and other) data as possible, and index and cross-index it all into an enormous database.

Allow users to type in a plain-text question or prompt.

Given the words and phrases the user typed in, search for what past humans have written in response to similarly phrased questions, copy them down and spit them out again as single-phrase-or-sentence snippets in what amounts to the world's first purpose-built plagiarism engine. What's the first declarative statement past humans have often used in their response? Copy that down first. What phrases tend to follow? Slap 'em in there. And so on.

Starting from the top, a large language model isn’t an “algorithm” any more than you could call a molecular physics simulation an “algorithm” for simulating a car. A Large Language Model (as well as plenty of other “deep learning” AIs that you’ve been unknowingly interacting with on the internet for well over a decade, such as the one that finds pictures of cute kittens when you type the word “kitten” into Google) is a neural network — specifically, it’s a number of layers of simulated neurons. You give the top layer of neurons input (in this case text, but it could be an image or audio or something else) as signals of different intensity.

Each of those neurons, depending on the intensity of the signal it receives (represented as a number), sends a signal to every neuron in the next layer, which, depending on its own weights, sends a signal to every neuron in the following layer, and so on, until you get to the output layer, which outputs whatever the neural network has been trained to output, which in the case of an LLM would be a bit of text called a token.

So, to address point number 1, the collecting of training data isn’t part of the AI itself. A lot of people who are misinformed about AI envision it as some kind of vacuum cleaner that crawls the web sucking up data all of the time. The actual data mining is done by the people collecting data to train it (which they almost always use software to do, but generally that software is just a regular web scraper program like Google uses for their search engine, as opposed to an actual AI). An AI can be trained on data found on the web, or it could also (for instance) be trained specifically on public domain data, and so on.

Point number 2 is correct.

#3 is a doozy. An AI, much like a brain, generally learns by generalizing from the data it’s trained on; what that means is that it mostly picks up general ideas. Only if it’s been trained on the same data over and over again will it memorize that data and regurgitate it. This was a mistake made with some of the earlier generative AIs that have been trained; nowadays the issues with duplicate data are well known enough that if a modern (circa 2023-2024) AI regurgitates something copyrighted, there’s a very good chance that it was deliberately overtrained on that thing so that it would memorize it. This is not, however, a problem that’s inherent to all neural networks or all LLMs specifically.

At any rate, what an AI does not do is store some kind of database of everything it was ever trained on, any more than your brain has a database of everything you’ve ever seen. The data is stored in the neural network as weights of different connections between simulated neurons. It is not a “plagiarism engine” any more than your brain is when you talk about facts you learned from reading a book. Sarah Silverman, in her partially dismissed case against OpenAI, alleges that her copyright has been violated just because ChatGPT can summarize her book, when summarizing a book is, the last time I checked, completely legal, and is done all the time by book reviewers, students, wikipedia article writers, and so on. In fact, this summarization is a good example of what AI actually does — it gains general knowledge about that things that it’s trained on (which in this case may have been Silverman’s book, or just book reviews the summarized it). Now, the jury is still out on whether or not ChatGPT violated her copyright by summarizing her book, but it’s notable that ChatGPT did with her book things that humans do with it all the time, so if it’s violating her copyright, then ChatGPT will be considered in violation of copyright for something that until now has been completely legal, and that may very well have ramifications for real human authors as well.

At any rate, I want to stress that an LLM doesn’t “search for what past humans have written in response to similarly phrased questions, copy them down and spit them out again as single-phrase-or-sentence snippets” — it simply doesn’t work like that, and that piece of misinformation is an underpinning to a lot of other misconceptions about AI and plagiarism.

Finally, I want to respond to a couple of other things I see people saying a lot:

“You’re just buying into the hype.”

I’m a computer scientist. I’m not interested in what CNN says about it, or what kind of crap sociopathic wanna-be billionaires are feeding to venture capitalists to scam them out of their money. There’s absolutely a lot of hype about AI, and it’s certainly being used as a buzzword, but it is not just a buzzword.

“AI just hallucinates and spouts whatever misinformation it thinks of”

How much an AI “hallucinates” depends a lot on how sophisticated it is. Google’s new Gemini AI, for instance, is an absolute joke and hallucinates all the time. The latest ChatGPT4o, on the other hand, is pretty reliable. In any case, most problems AI exhibits have human analogues. An AI hallucinates, a human misremembers. Humans can learn and spread misinformation, and an AI can be trained on misinformation. What I haven’t seen is any comparison of how often a state-of-the-art AI hallucinates when answering questions, versus how often a typical human (or even a subject matter expert) misremembers or gets things wrong. If I had to guess, I’d say it’s a far more reliable reliable source than a random person on the street, and somewhat less reliable than a subject matter expert.

“This article says you’re wrong about how AI works.”

The article probably has its facts wrong. Show me a research paper.

“AI is stealing.”

The jury is literally still out on this. The realistic take is that copyright law currently has very little to say about it. Training an AI is generally just analyzing data and learning facts from it, and facts can’t actually be copyrighted. Court cases and regulations could change this, but no one knows right now how it will all fall into place.

“AI is useless and is going to go away soon.”

We’re certainly in the middle of a gold rush, and certainly a lot of wanna be billionaires and investors are going to lose out on a lot of cash when they discover they jumped into something they have no idea about, but AI and LLMs are already useful and are here to stay. ChatGPT 4 is already useful for all sorts of things. It’s true that it sometimes gets minor points wrong, but it’s often way quicker to ask ChatGPT a question and then verify the answer on google than google the question and dig through pages and pages of irrelevant garbage. I’ve successfully asked ChatGPT for recipes, I’ve had it write programs (which sometimes have small things wrong with them but are much quicker to fix than write from scratch), I’ve played Dungeons & Dragons with it, I frequently ask it what the technical term is for various things (which is next to impossible with Google), ask it for general knowledge, trivia, and so on.

You need to bear in mind that when an AI suggests glue in a pizza recipe, that fact gets trumpeted all over the news because a) it’s hilarious, and b) it’s validating for people to read if they’re the sort of people who want AI to go away, so what you end up with is perhaps an unrealistic picture of how often AI fails (and you might be led to believe that all AI suggests stupid things like glue in pizza dough, as opposed to understanding that it’s really just Google’s embarrassingly bad one that’s the source of a lot of these more recent failures).

I think a lot of people are also assuming that the technology is static. Honestly, LLMs almost certainly aren’t the endpoint; we’re most likely going to reach the limit of what our current, layered neural network topology is capable of doing, and researchers are almost certainly working on ideas to move beyond that toward something more brain-like and suited for AGI (artificial general intelligence), which is what ChatGPT kind of pretends to be, but isn’t.

Anyway, I’m not great at conclusions. I’d be happy to answer questions about AI, but I’m not interesting in talking about what the news or random tech startups have to say about it, because a lot of that stuff is wrong and overhyped.

[END]
---
[1] Url: https://www.dailykos.com/stories/2024/6/18/2247347/-I-know-we-officially-hate-AI-but-let-s-not-be-anti-science?pm_campaign=front_page&pm_source=latest_community&pm_medium=web

Published and (C) by Daily Kos
Content appears here under this condition or license: Site content may be used for any purpose without permission unless otherwise specified.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/dailykos/