[HN Gopher] Playing games with AIs: The limits of GPT-3 and simi... ___________________________________________________________________ Playing games with AIs: The limits of GPT-3 and similar large language models Author : nigamanth Score : 73 points Date : 2023-01-07 06:19 UTC (16 hours ago) (HTM) web link (link.springer.com) (TXT) w3m dump (link.springer.com) | visarga wrote: | While I agree with the authors that large language models only | trained on text lack the ability to distinguish "possible worlds" | from reality, I think there is a path ahead. | | Large language models might be excellent candidates for | evolutionary methods and RL. They need to learn from solving | language problems on a massive scale. But problem solving could | be the medicine that cures GPT-3's fuzziness, a bit of symbolic | exactness injected into the connectionist system. | | For example: "Evolution through Large Models" | https://arxiv.org/abs/2206.08896 | | They need learning from validation to complement learning from | imitation. | zwaps wrote: | This is what ChatGPT does, for instance. | tbalsam wrote: | In a sense, I thought it was more human-in-the-loop rather | than an explicit RL objective (i.e. potentially somewhat | limited reward surface, even if there's a reward model | trained from it) | make3 wrote: | the step from gpt 2 to gpt3 showed us that any attempt at | predicting the behavior of sufficiently scaled up models is | really futile | optimalsolver wrote: | Would there be any benefit in modeling raw binary sequences | rather than tokens? | | I think text prediction only gets you so far. But I guess you | could use the same principles to predict the next symbol in a | binary string. If this binary data represents something like | videos of physical phenomena, you might get the AI to profound, | novel insights about the Universe just with next-bit prediction. | | Hmmm, maybe even I could code something like that. | dwaltrip wrote: | I'm waiting for someone to make a GPT-style model trained for | video and audio prediction (e.g. frame by frame, perhaps) in | addition to the existing text prediction. Imagine using a | significant percentage of YouTube content, for example. | | It would probably be insanely expensive. But I feel like it | would be almost guaranteed to acquire a world model far richer | and more robust than ChatGPT's. | | Human babies learn by watching the world around them. Video | frame prediction feels much closer to that than text | prediction, and given the wildly impressive results we are | seeing with large text prediction models alone, it seems like | an obvious next step. | tintor wrote: | There is VideoGPT paper. It uses small frame sizes, but that | will improve with time. | boredemployee wrote: | >> Human babies learn by watching the world around them. | | While I understood what you meant, I'd just add that babies | learn by a combination of multisensory triggers (so not only | _watching_) | nodemaker wrote: | The reason this works for tokens is that tokens are put in a | vector space where similar words are in a similar place. The | same effect could not be achieved with characters or bits. If | you think about it our brains also remember words and not | characters | eternalban wrote: | > our brains also remember words | | Word sounds. I can _not_ read without hearing the word. (Now | I wonder about those born deaf.) Based on that subjective | experience which I presume is rather universal among the | hearing, tokenized phones and phonemes seem promising. | worik wrote: | I am not deaf. | | I do not hear words as I read. (As I write I do) | visarga wrote: | So you are proposing a massive video model, on the likes of | GPT-3? The architecture is simple, but making it train | correctly and efficiently is really hard, especially for video. | [deleted] | marmadukester39 wrote: | Is it? Videos are just sequences of frames | rdedev wrote: | Each frame of the image would have to be divided into many | sequences. Atleast that's how transformer based image | models work. Then you have to account for audio data too in | the same way. It just blows up the compute required | optimalsolver wrote: | Not quite. I meant something that models pure binary | sequences, not higher level tokens. That way, it could learn | from any source that can be represented as binary data. Could | be video, text, audio, or all three at once. | | It wouldn't be "video model", it would be an "anything that | can be expressed in binary" model. | visarga wrote: | Maybe you are interested in this paper: | | > Perceiver: General Perception with Iterative Attention | | Biological systems perceive the world by simultaneously | processing high dimensional inputs from modalities as | diverse as vision, audition, touch, proprioception, etc. | Perceiver is a deep learning model that can process | multiple modalities, such as images, point clouds, audio, | and video, simultaneously. It is based on the transformer | architecture and uses an asymmetric attention mechanism to | distill a large number of inputs into a smaller latent | bottleneck. This allows it to scale to handle very large | inputs and outperform specialized models on classification | tasks across various modalities. | | https://arxiv.org/abs/2103.03206 | optimalsolver wrote: | Thanks! This looks really interesting. | decremental wrote: | That might be too low(?) resolution. It would be learning | encodings instead of features of the thing that is being | encoded. Like training it on terabytes of zip files and | expecting it to reproduce from the files contained in the | archives. | optimalsolver wrote: | Imagine a single, 120 minute movie in a tar file. | | How much of this file's raw data would be encoding and | metadata, vs the content of the movie? | decremental wrote: | The thing is even the video data outside of the tar file is | also encoded. Most likely the compressed video data will be | basically random. You can't train it on random data, it's | just noise. Rather it would make more sense to train on | sequences of RGBA pixels. | | It's a seductive thought to be able to just throw raw bits | at a model, regardless of what those bits represent, and | have it just magically attain LLM qualities in reproducing | the data you would want it to. | | Something to think about: GPT3/ChatGPT tokenize at the byte | level. If they tokenized at the bit level the model would | learn Utf8 encoding over time. Unicode characters that | require more than one byte to represent, such as emojis, | are not learned directly but the model can still reproduce | them. | stared wrote: | Tokenization for models like GPT or BERT can be seen as | compression. That is, frequent words are separate tokens. | Frequent sequences are separate tokens. On the other hand, if a | sequence is very uncommon, then it will contain many tokens. | | Sure, you encode bit-by-bit. But it is a fixed-length code, | which is even worse than character-by-character. | | Maybe you only get worse training and inference time. But I | wouldn't be surprised if the encoding also serves as a Bayesian | prior, and with a different encoding, you get worse results | (for given data). | tbalsam wrote: | There's a couple of massive intuition leaps here (around tokens | and the ease of which predicting one modality extends to | another), but if you're interested in diving into the field at | the place where they're asking questions like this, you could | start by looking at the transition from BPE to the tokenizer we | have today for the tokenization front, and PercieverIO for the | multimodal generalization front. | ttul wrote: | Google Research has a character-based transformer that learns | to tokenize text rather than relying on hand coded tokenizers. | It demonstrates superior performance on a variety of LLM tasks. | | If you have the money, you can apply the transformer | architecture to many different tasks and people are | experimenting all the time. I think one of the big challenges | is always to come up with methods for training such enormous | models pragmatically without cost exploding. | | [1] https://huggingface.co/docs/transformers/model_doc/canine | Der_Einzige wrote: | Related, I wrote "language games" for playing word games with | word vectors. I've thought about remaking this beyond my original | weekend project and including the latest of language models in. | https://github.com/Hellisotherpeople/Language-games ___________________________________________________________________ (page generated 2023-01-07 23:00 UTC)