[HN Gopher] MinGPT: Minimal PyTorch re-implementation of GPT ___________________________________________________________________ MinGPT: Minimal PyTorch re-implementation of GPT Author : memorable Score : 195 points Date : 2022-09-06 12:14 UTC (10 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | karpathy wrote: | Hah funny to see this on HN, it is a relatively old project but | one that I continue to love and still work on. I was trying to | train a GPT one day and discovered that available implementations | were quite complex, spread across many files, and took way too | many kwargs switches for esoteric/rare options that just bloated | and complexified the code. But in my head a GPT was a super | simple neat, isotropic model, so I got all worked up and wrote | minGPT. | | The project went on to have more impact than I originally | imagined and made its way into a number of projects and papers. | One of those I found only a few days ago here: | https://twitter.com/karpathy/status/1566100736076697600 . What I | love about these projects is that the authors often "hack up" | minGPT in code directly. They don't configure a comprehensive | kwarg monster. I think there's a beauty in that. Very often I | wish we had more gists and fewer frameworks - to look at code | chunks, understand them completely, tune them to our needs, and | re-use them in projects, similar to how bacteria trade little DNA | plasmids. minGPT is written for those who want that for their GPT | projects. There's plenty of cons to this approach too, ultimately | I think there's value in both approaches. | | Coming up the theme of future minGPT development: more examples, | and more teeth - it should be possible to demonstrate the | training of relatively serious (~few B) models with minGPT on one | n-gpu node and reproduce some benchmarks around that scale, but | never sacrifice its readability. | ghub-mmulet wrote: | Thanks for making it it! There is immense value in something | you can just dive into and hack on. I've been hacking on stable | Diffusion/latent diffusion these past couple weeks, and you | don't know how much time it would have saved me, if it just had | something similar! | darawk wrote: | For anyone else who was new to the phrase "isotropic model": | | https://github.com/christianversloot/machine-learning-articl... | albertzeyer wrote: | This works for an architecture which has been well tuned and | studied before, like LSTM or Transformer. | | Once you do research on the model, testing out things, it often | tends to become such kwarg monster in many frameworks. | | Having everything (relevant) in one file (even in the config | file itself with hyper params) allows you to copy the file for | every experiment and modify it inplace. This avoids the kwargs | mess. But then the config files are very complex, and can | become messy in other ways (esp for research projects). | Example: https://github.com/rwth-i6/returnn- | experiments/blob/master/2... | | Such approach makes it much more flexible and does not mess | with the baseline code. As you say, it's more like an | evolutionary DNA-like approach, where you then tend to do | crossovers with other evolved good-performing configs, etc. | jphoward wrote: | I completely agree! I personally find these powerful new | network releases border on the depressing, in that they aren't | really network releases but huge training systems of dispersed | YAMLs. YOLOv4 was a case in point where I was too overwhelmed | to try and integrate it into a project I was working on. | | PS you are a hero of mine - I'm an academic medical doctor for | who CS231n was my first foray into AI, and since then I've gone | on to gold medal in a couple of Kaggle competitions and secured | 5 years of higher research funding to pursue clinical AI. I am | immensely grateful to you and Fei-Fei Li. | Siira wrote: | Are there any similarly structured projects around? | HeckFeck wrote: | Here was I thinking someone had recreated the GUID Partition | Table in some form of micropython. Perhaps someday. | rexreed wrote: | With enough training data and enough GPUs to do the model | training, you'll be there! Goes to show that for AI, the code | really isn't the important part. AI is and always has been about | data and compute. | s_Hogg wrote: | Karpathy really seems to have discovered there are a lot of hours | in the day now he doesn't work for Tesla | liuliu wrote: | Not only him. The tech boom in the past decade made a lot of | great programmers rich, and it is a good thing. Looking also at | how Aras Pranckevicious (of the Unity fame) is now contributing | to Blender. (Also to some extents Rui (for mold fame) and Raph | Levien (for xi editor fame), although not certain about their | financial standing). | ShamelessC wrote: | This implementation is quite old now actually - although I | agree, it certainly seems that way otherwise :) | jstx1 wrote: | He was doing this kind of stuff while he was at Tesla too - | https://github.com/karpathy/cryptos | horseRad wrote: | Pretty sure he wrote this while working at Tesla also | mark_l_watson wrote: | Nice! I remember way back when studying Karpathy's character RNN | code, a great study resource. Looking forwards to understanding | this example also! | karpathy wrote: | I am working on a video lecture series that will step through | it and "spell it out". Without it even this code can be a bit | opaque for someone who is new to the field and e.g. | uncomfortable with n-dimensional array manipulations or the | surrounding language modeling concepts. | derac wrote: | I love your approach and philosophy around programming. If anyone | is unaware, Karpathy has a relatively small youtube channel he | started a few weeks ago. https://youtu.be/VMj-3S1tku0 | polygamous_bat wrote: | This is actually a pretty neat, self-contained implementation | that can super easily extended beyond stereotypical natural | language models, for example to create world models for video | games [1] or to create robot models that can learn to imitate | from large, chaotic human demonstration data [2] (disclaimer, I'm | an author on the second one.) Basically, GPT (or minGPT) models | are EXCELLENT sequence modelers, almost to the point where you | can throw any sensible sequence data at it and hope to get | interesting results, as long as you don't overfit. | | Even though I have only been working on machine learning for | around six years, it's crazy to see how the landscape has changed | so fast so recently, including diffusion models and transformers. | It's not too much to say that we might expect more major | breakthroughs by the end of this decade, and end in a place we | can't even imagine right now! | | [1] https://github.com/eloialonso/iris [2] | https://github.com/notmahi/bet | a-dub wrote: | > Even though I have only been working on machine learning for | around six years, it's crazy to see how the landscape has | changed so fast so recently, including diffusion models and | transformers. | | it's pretty wild considering how hidden markov models were | considered state of the art not all that long ago. | visarga wrote: | Some people demean GPT-3 saying it's just a Markov model. | dang wrote: | Related: | | _Karpathy 's MinGPT_ - | https://news.ycombinator.com/item?id=24189497 - Aug 2020 (102 | comments) ___________________________________________________________________ (page generated 2022-09-06 23:00 UTC)