[HN Gopher] Playing games with AIs: The limits of GPT-3 and simi...
       ___________________________________________________________________
        
       Playing games with AIs: The limits of GPT-3 and similar large
       language models
        
       Author : nigamanth
       Score  : 73 points
       Date   : 2023-01-07 06:19 UTC (16 hours ago)
        
 (HTM) web link (link.springer.com)
 (TXT) w3m dump (link.springer.com)
        
       | visarga wrote:
       | While I agree with the authors that large language models only
       | trained on text lack the ability to distinguish "possible worlds"
       | from reality, I think there is a path ahead.
       | 
       | Large language models might be excellent candidates for
       | evolutionary methods and RL. They need to learn from solving
       | language problems on a massive scale. But problem solving could
       | be the medicine that cures GPT-3's fuzziness, a bit of symbolic
       | exactness injected into the connectionist system.
       | 
       | For example: "Evolution through Large Models"
       | https://arxiv.org/abs/2206.08896
       | 
       | They need learning from validation to complement learning from
       | imitation.
        
         | zwaps wrote:
         | This is what ChatGPT does, for instance.
        
           | tbalsam wrote:
           | In a sense, I thought it was more human-in-the-loop rather
           | than an explicit RL objective (i.e. potentially somewhat
           | limited reward surface, even if there's a reward model
           | trained from it)
        
       | make3 wrote:
       | the step from gpt 2 to gpt3 showed us that any attempt at
       | predicting the behavior of sufficiently scaled up models is
       | really futile
        
       | optimalsolver wrote:
       | Would there be any benefit in modeling raw binary sequences
       | rather than tokens?
       | 
       | I think text prediction only gets you so far. But I guess you
       | could use the same principles to predict the next symbol in a
       | binary string. If this binary data represents something like
       | videos of physical phenomena, you might get the AI to profound,
       | novel insights about the Universe just with next-bit prediction.
       | 
       | Hmmm, maybe even I could code something like that.
        
         | dwaltrip wrote:
         | I'm waiting for someone to make a GPT-style model trained for
         | video and audio prediction (e.g. frame by frame, perhaps) in
         | addition to the existing text prediction. Imagine using a
         | significant percentage of YouTube content, for example.
         | 
         | It would probably be insanely expensive. But I feel like it
         | would be almost guaranteed to acquire a world model far richer
         | and more robust than ChatGPT's.
         | 
         | Human babies learn by watching the world around them. Video
         | frame prediction feels much closer to that than text
         | prediction, and given the wildly impressive results we are
         | seeing with large text prediction models alone, it seems like
         | an obvious next step.
        
           | tintor wrote:
           | There is VideoGPT paper. It uses small frame sizes, but that
           | will improve with time.
        
           | boredemployee wrote:
           | >> Human babies learn by watching the world around them.
           | 
           | While I understood what you meant, I'd just add that babies
           | learn by a combination of multisensory triggers (so not only
           | _watching_)
        
         | nodemaker wrote:
         | The reason this works for tokens is that tokens are put in a
         | vector space where similar words are in a similar place. The
         | same effect could not be achieved with characters or bits. If
         | you think about it our brains also remember words and not
         | characters
        
           | eternalban wrote:
           | > our brains also remember words
           | 
           | Word sounds. I can _not_ read without hearing the word. (Now
           | I wonder about those born deaf.) Based on that subjective
           | experience which I presume is rather universal among the
           | hearing, tokenized phones and phonemes seem promising.
        
             | worik wrote:
             | I am not deaf.
             | 
             | I do not hear words as I read. (As I write I do)
        
         | visarga wrote:
         | So you are proposing a massive video model, on the likes of
         | GPT-3? The architecture is simple, but making it train
         | correctly and efficiently is really hard, especially for video.
        
           | [deleted]
        
           | marmadukester39 wrote:
           | Is it? Videos are just sequences of frames
        
             | rdedev wrote:
             | Each frame of the image would have to be divided into many
             | sequences. Atleast that's how transformer based image
             | models work. Then you have to account for audio data too in
             | the same way. It just blows up the compute required
        
           | optimalsolver wrote:
           | Not quite. I meant something that models pure binary
           | sequences, not higher level tokens. That way, it could learn
           | from any source that can be represented as binary data. Could
           | be video, text, audio, or all three at once.
           | 
           | It wouldn't be "video model", it would be an "anything that
           | can be expressed in binary" model.
        
             | visarga wrote:
             | Maybe you are interested in this paper:
             | 
             | > Perceiver: General Perception with Iterative Attention
             | 
             | Biological systems perceive the world by simultaneously
             | processing high dimensional inputs from modalities as
             | diverse as vision, audition, touch, proprioception, etc.
             | Perceiver is a deep learning model that can process
             | multiple modalities, such as images, point clouds, audio,
             | and video, simultaneously. It is based on the transformer
             | architecture and uses an asymmetric attention mechanism to
             | distill a large number of inputs into a smaller latent
             | bottleneck. This allows it to scale to handle very large
             | inputs and outperform specialized models on classification
             | tasks across various modalities.
             | 
             | https://arxiv.org/abs/2103.03206
        
               | optimalsolver wrote:
               | Thanks! This looks really interesting.
        
         | decremental wrote:
         | That might be too low(?) resolution. It would be learning
         | encodings instead of features of the thing that is being
         | encoded. Like training it on terabytes of zip files and
         | expecting it to reproduce from the files contained in the
         | archives.
        
           | optimalsolver wrote:
           | Imagine a single, 120 minute movie in a tar file.
           | 
           | How much of this file's raw data would be encoding and
           | metadata, vs the content of the movie?
        
             | decremental wrote:
             | The thing is even the video data outside of the tar file is
             | also encoded. Most likely the compressed video data will be
             | basically random. You can't train it on random data, it's
             | just noise. Rather it would make more sense to train on
             | sequences of RGBA pixels.
             | 
             | It's a seductive thought to be able to just throw raw bits
             | at a model, regardless of what those bits represent, and
             | have it just magically attain LLM qualities in reproducing
             | the data you would want it to.
             | 
             | Something to think about: GPT3/ChatGPT tokenize at the byte
             | level. If they tokenized at the bit level the model would
             | learn Utf8 encoding over time. Unicode characters that
             | require more than one byte to represent, such as emojis,
             | are not learned directly but the model can still reproduce
             | them.
        
         | stared wrote:
         | Tokenization for models like GPT or BERT can be seen as
         | compression. That is, frequent words are separate tokens.
         | Frequent sequences are separate tokens. On the other hand, if a
         | sequence is very uncommon, then it will contain many tokens.
         | 
         | Sure, you encode bit-by-bit. But it is a fixed-length code,
         | which is even worse than character-by-character.
         | 
         | Maybe you only get worse training and inference time. But I
         | wouldn't be surprised if the encoding also serves as a Bayesian
         | prior, and with a different encoding, you get worse results
         | (for given data).
        
         | tbalsam wrote:
         | There's a couple of massive intuition leaps here (around tokens
         | and the ease of which predicting one modality extends to
         | another), but if you're interested in diving into the field at
         | the place where they're asking questions like this, you could
         | start by looking at the transition from BPE to the tokenizer we
         | have today for the tokenization front, and PercieverIO for the
         | multimodal generalization front.
        
         | ttul wrote:
         | Google Research has a character-based transformer that learns
         | to tokenize text rather than relying on hand coded tokenizers.
         | It demonstrates superior performance on a variety of LLM tasks.
         | 
         | If you have the money, you can apply the transformer
         | architecture to many different tasks and people are
         | experimenting all the time. I think one of the big challenges
         | is always to come up with methods for training such enormous
         | models pragmatically without cost exploding.
         | 
         | [1] https://huggingface.co/docs/transformers/model_doc/canine
        
       | Der_Einzige wrote:
       | Related, I wrote "language games" for playing word games with
       | word vectors. I've thought about remaking this beyond my original
       | weekend project and including the latest of language models in.
       | https://github.com/Hellisotherpeople/Language-games
        
       ___________________________________________________________________
       (page generated 2023-01-07 23:00 UTC)