[HN Gopher] MinGPT: Minimal PyTorch re-implementation of GPT
       ___________________________________________________________________
        
       MinGPT: Minimal PyTorch re-implementation of GPT
        
       Author : memorable
       Score  : 195 points
       Date   : 2022-09-06 12:14 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | karpathy wrote:
       | Hah funny to see this on HN, it is a relatively old project but
       | one that I continue to love and still work on. I was trying to
       | train a GPT one day and discovered that available implementations
       | were quite complex, spread across many files, and took way too
       | many kwargs switches for esoteric/rare options that just bloated
       | and complexified the code. But in my head a GPT was a super
       | simple neat, isotropic model, so I got all worked up and wrote
       | minGPT.
       | 
       | The project went on to have more impact than I originally
       | imagined and made its way into a number of projects and papers.
       | One of those I found only a few days ago here:
       | https://twitter.com/karpathy/status/1566100736076697600 . What I
       | love about these projects is that the authors often "hack up"
       | minGPT in code directly. They don't configure a comprehensive
       | kwarg monster. I think there's a beauty in that. Very often I
       | wish we had more gists and fewer frameworks - to look at code
       | chunks, understand them completely, tune them to our needs, and
       | re-use them in projects, similar to how bacteria trade little DNA
       | plasmids. minGPT is written for those who want that for their GPT
       | projects. There's plenty of cons to this approach too, ultimately
       | I think there's value in both approaches.
       | 
       | Coming up the theme of future minGPT development: more examples,
       | and more teeth - it should be possible to demonstrate the
       | training of relatively serious (~few B) models with minGPT on one
       | n-gpu node and reproduce some benchmarks around that scale, but
       | never sacrifice its readability.
        
         | ghub-mmulet wrote:
         | Thanks for making it it! There is immense value in something
         | you can just dive into and hack on. I've been hacking on stable
         | Diffusion/latent diffusion these past couple weeks, and you
         | don't know how much time it would have saved me, if it just had
         | something similar!
        
         | darawk wrote:
         | For anyone else who was new to the phrase "isotropic model":
         | 
         | https://github.com/christianversloot/machine-learning-articl...
        
         | albertzeyer wrote:
         | This works for an architecture which has been well tuned and
         | studied before, like LSTM or Transformer.
         | 
         | Once you do research on the model, testing out things, it often
         | tends to become such kwarg monster in many frameworks.
         | 
         | Having everything (relevant) in one file (even in the config
         | file itself with hyper params) allows you to copy the file for
         | every experiment and modify it inplace. This avoids the kwargs
         | mess. But then the config files are very complex, and can
         | become messy in other ways (esp for research projects).
         | Example: https://github.com/rwth-i6/returnn-
         | experiments/blob/master/2...
         | 
         | Such approach makes it much more flexible and does not mess
         | with the baseline code. As you say, it's more like an
         | evolutionary DNA-like approach, where you then tend to do
         | crossovers with other evolved good-performing configs, etc.
        
         | jphoward wrote:
         | I completely agree! I personally find these powerful new
         | network releases border on the depressing, in that they aren't
         | really network releases but huge training systems of dispersed
         | YAMLs. YOLOv4 was a case in point where I was too overwhelmed
         | to try and integrate it into a project I was working on.
         | 
         | PS you are a hero of mine - I'm an academic medical doctor for
         | who CS231n was my first foray into AI, and since then I've gone
         | on to gold medal in a couple of Kaggle competitions and secured
         | 5 years of higher research funding to pursue clinical AI. I am
         | immensely grateful to you and Fei-Fei Li.
        
         | Siira wrote:
         | Are there any similarly structured projects around?
        
       | HeckFeck wrote:
       | Here was I thinking someone had recreated the GUID Partition
       | Table in some form of micropython. Perhaps someday.
        
       | rexreed wrote:
       | With enough training data and enough GPUs to do the model
       | training, you'll be there! Goes to show that for AI, the code
       | really isn't the important part. AI is and always has been about
       | data and compute.
        
       | s_Hogg wrote:
       | Karpathy really seems to have discovered there are a lot of hours
       | in the day now he doesn't work for Tesla
        
         | liuliu wrote:
         | Not only him. The tech boom in the past decade made a lot of
         | great programmers rich, and it is a good thing. Looking also at
         | how Aras Pranckevicious (of the Unity fame) is now contributing
         | to Blender. (Also to some extents Rui (for mold fame) and Raph
         | Levien (for xi editor fame), although not certain about their
         | financial standing).
        
         | ShamelessC wrote:
         | This implementation is quite old now actually - although I
         | agree, it certainly seems that way otherwise :)
        
         | jstx1 wrote:
         | He was doing this kind of stuff while he was at Tesla too -
         | https://github.com/karpathy/cryptos
        
           | horseRad wrote:
           | Pretty sure he wrote this while working at Tesla also
        
       | mark_l_watson wrote:
       | Nice! I remember way back when studying Karpathy's character RNN
       | code, a great study resource. Looking forwards to understanding
       | this example also!
        
         | karpathy wrote:
         | I am working on a video lecture series that will step through
         | it and "spell it out". Without it even this code can be a bit
         | opaque for someone who is new to the field and e.g.
         | uncomfortable with n-dimensional array manipulations or the
         | surrounding language modeling concepts.
        
       | derac wrote:
       | I love your approach and philosophy around programming. If anyone
       | is unaware, Karpathy has a relatively small youtube channel he
       | started a few weeks ago. https://youtu.be/VMj-3S1tku0
        
       | polygamous_bat wrote:
       | This is actually a pretty neat, self-contained implementation
       | that can super easily extended beyond stereotypical natural
       | language models, for example to create world models for video
       | games [1] or to create robot models that can learn to imitate
       | from large, chaotic human demonstration data [2] (disclaimer, I'm
       | an author on the second one.) Basically, GPT (or minGPT) models
       | are EXCELLENT sequence modelers, almost to the point where you
       | can throw any sensible sequence data at it and hope to get
       | interesting results, as long as you don't overfit.
       | 
       | Even though I have only been working on machine learning for
       | around six years, it's crazy to see how the landscape has changed
       | so fast so recently, including diffusion models and transformers.
       | It's not too much to say that we might expect more major
       | breakthroughs by the end of this decade, and end in a place we
       | can't even imagine right now!
       | 
       | [1] https://github.com/eloialonso/iris [2]
       | https://github.com/notmahi/bet
        
         | a-dub wrote:
         | > Even though I have only been working on machine learning for
         | around six years, it's crazy to see how the landscape has
         | changed so fast so recently, including diffusion models and
         | transformers.
         | 
         | it's pretty wild considering how hidden markov models were
         | considered state of the art not all that long ago.
        
           | visarga wrote:
           | Some people demean GPT-3 saying it's just a Markov model.
        
       | dang wrote:
       | Related:
       | 
       |  _Karpathy 's MinGPT_ -
       | https://news.ycombinator.com/item?id=24189497 - Aug 2020 (102
       | comments)
        
       ___________________________________________________________________
       (page generated 2022-09-06 23:00 UTC)