[HN Gopher] Replicating GPT-2 at Home ___________________________________________________________________ Replicating GPT-2 at Home Author : bkkaggle Score : 174 points Date : 2021-01-23 16:52 UTC (6 hours ago) (HTM) web link (bkkaggle.github.io) (TXT) w3m dump (bkkaggle.github.io) | kyberias wrote: | How many off-the-shelf GPUs are needed to replicate GPT-2 in a | year? | minimaxir wrote: | With current improvements to training performance and | parallelism (e.g. DeepSpeed: https://www.deepspeed.ai ) it | wouldn't surprise me if creating GPT-2 small from scratch | becomes possible with a couple 3080s in _days_ , with GPT-2 XL | not taking 10x longer. | moyix wrote: | I agree. I've been training on 2x3090s connected via NVLink | and they're _really_ fast for training language models. I am | actually tempted to try and replicate the OP 's GPT2 | replication using Huggingface, DeepSpeed, and OpenWebText, | but the GPUs are occupied right now training a GPT2-774M C | language model... | Jack000 wrote: | Does nvlink actually help? It's mostly useful for | transferring data between gpus so I assume you're using | pipeline parallelism or similar? | natch wrote: | What software stack are you using to get your 3090s | working? Any hitches along the way? | moyix wrote: | Linux (Ubuntu 20.04) + Cuda 11.2. For the backend I use | PyTorch; Tensorflow has some nice optimizations (like | XLA, which uses LLVM to JIT optimized code for the GPU), | but I found it very painful to get working reliably, and | most of the language modeling stuff I've seen uses | PyTorch. | | For the language model training itself I've been | experimenting with a few different things. I started off | with Huggingface because it's very easy to get up and | running, and I still use its tokenizers library to do BPE | training on the C source dataset (though there are still | some hitches there - other libraries expect slightly | different formats for the tokenizer model, like using | different ways to represent the <|endoftext|> marker). | | After prototyping the C language model training at home, | I tried moving the training up to NYU's HPC cluster, | which has a bunch of 4xV100 and 4xRTX8000 nodes (mainly | because the sound of two powerful GPU fans running at | 100% gets a bit old after a while). Unfortunately I | discovered that with larger models the GPU-GPU | communication overhead can be prohibitive (most of the | cluster nodes only support P2P GPU communication over | PCIe, which is a _lot_ slower than NVLink), and | Huggingface 's implementation actually performed _worse_ | on multiple GPUs than on two 3090s with NVLink (I opened | an issue track it here | https://github.com/huggingface/transformers/issues/9371 | ). | | Currently I'm working on getting DeepSpeed running so | that I can hopefully get better scaling even in the | absence of a fast GPU-GPU interconnect. This is again a | little bit annoying, because it seems like every | framework wants a slightly different way of representing | the tokenizer and training data - I've had to preprocess | the dataset in about 4 different ways (plain text, loose | JSON, npy (for DeepSpeed), and a custom indexed binary | format for Megatron-LM). I'm also hoping to try out | Huggingface's recently-released DeepSpeed integration, | which (if it works) would be a really nice combination of | usability and performance: | https://huggingface.co/blog/zero-deepspeed-fairscale | | As for other software stack hitches: so, so many. The | main one is just managing the different versions of CUDA. | The 3090 is only supported starting with CUDA 11.1, but | many packages and frameworks only support 11.0 at best. | And some of the newer things like DeepSpeed use PyTorch | extensions, which require you to have the exact version | of CUDA around that was used to build PyTorch. So I've | had to do a fair bit of compiling packages from source | rather than relying on prebuilt packages. | | The path of least resistance here is probably to use the | NVIDIA NGC containers, but it took NVIDIA more than a | month to get them updated after the 3090 was released, | and I find working inside containers for everything | inconvenient anyway (I hate losing my bash history, and I | always accidentally end up losing data or local changes | when I exit a container). | | Anyway, this ended up being a bit more rambling than I | intended, but it was helpful to write it all down and | maybe it'll help someone else avoid some stumbling blocks | :) | minimaxir wrote: | As someone who maintains a package to both make it easy to fine- | tune GPT-2 or create your own from scratch | (https://github.com/minimaxir/aitextgen), this submission is a | good run-through of the technical considerations toward building | a GPT-2 model. | | It's both substantially easier and faster than it was when OpenAI | released their paper in 2019, thanks to both Huggingface | Transformers and Tokenizers making the architectures more | efficient and other companies streamlining the training process | and make it more efficient for all parts in the pipeline. | | You don't need a TPU cluster to train a working GPT-2 model, | although it helps (unfortunately TPU support on PyTorch-based | training like aitextgen is more fussy). A free GPU on Colab gets | you most of the way, especially since you can get now a T4 or a | V100 which lets you use FP16. | bkkaggle wrote: | Yep i started off with trying to get it to work with pytorch | (https://github.com/bkkaggle/lm-training-research- | project/blo...) then with pt-lightning but the whole 1 user VM | per TPU board limitation in pytorch-xla 7-8 months ago made me | switch over to TF | punnerud wrote: | Just as Google want you to do. Within 3-5 years you will | probably see a high price increase and no where to go. | bkkaggle wrote: | heh. I've been using jax for a couple of months and its | been a pretty nice replacement of both pt and tf. it feels | like what a ml framework would look like if it were built | around easy scaling and dev friendliness. | bravura wrote: | What do you think would be necessary to generate rhyming text | with a particular phrasing / rhythm? | | e.g. in the style of a particular rapper? | | If you just fine-tune on a corpus of their lyrics, you might | miss the underlying poetic constraints. | | If there were an additional prior (a "poetry / assonance / | rhyme" model), what is the easiest way to constrain generation | to respect this prior? | | Thanks! | drusepth wrote: | I wrote "Stylistic Rhyme-bound Poetry Generation or: How You | Too Can Generate Sonnets in the Style of Kanye West" [1] back | in 2017 for an easy DIY introduction to this topic. You | specify the rhyming scheme (ABAB CDCD etc) and it forces end- | line rhymes around it. | | It uses Markov chains instead of GPT-2, but the approach | should work with prompt-based things like GPT-2 also: for | lines that are "free" (e.g. no specific word you need to | rhyme with), you can generate the line normally -- but for | lines you need to rhyme with a specific word, you can just | generate last-word-first and generate backwards. For a | strictly LTR prompt like GPT-2, you could probably just | reverse your corpus word order, generate "reverse" lines with | GPT-2 given the previous line + word you need to rhyme with | as the prompt, and then reverse it back to "normal" in | postprocessing. | | [1] https://festivalpeak.com/stylistic-rhyme-bound-poetry- | genera... | | Some examples of the output of this approach: | | [2] https://medium.com/words-of-mimicry/kanye-west- | ballade-1-a6f... | | [3] https://medium.com/words-of-mimicry/me-you-and-slow-sip- | slow... | | I'd expect the output to be better with something like | GPT-2/3, since Markov chains are so twentieth-century, but I | was pretty happy at the output quality even though it often | rhymed the same word repeatedly; you could improve it by | weighting previously-used words, removing them from the pool | of rhyming words, and/or backtracking to previous lines when | you find yourself without other words to rhyme. | minimaxir wrote: | A paper was recently released for that particular use case | (https://github.com/markriedl/weirdai), in which it describes | a number of technical caveats (and it's technically not using | GPT-2). | | I do think it's possible to train a GPT-2-esque network to do | something similar, albeit with some text encoding | shenanigans. | FL33TW00D wrote: | As far as I know to get a V100 you need Colab Pro? Did this | change recently? | minimaxir wrote: | It's unclear. I've heard people get the V100 without Colab | Pro. Albeit I do use Colab Pro and get a V100 almost each | time. | | As an aside, if you do get a V100, Colab Pro is by-far the | cheapest way to train an AI model. ($10/mo is much, much | cheaper than $2.48+/hr on GCP normally!) Although you need to | sync checkpoints to off-loaded storage in case the Notebook | dies. | fpgaminer wrote: | > As an aside, if you do get a V100, Colab Pro is by-far | the cheapest way to train an AI model. | | But others should be aware that you get what you pay for. | Google still rate limited me when I used Colab Pro, and I | ran into a myriad of other small problems. If that's all | one is willing to spend to play with AI, 100% go for it. | It's a great place to start. But if you're at all serious | and can afford it, I think a local machine with a modest | GPU is worth every penny. | nsomaru wrote: | Curious; is it better to train locally on something like | a 2080ti 11G or go for colab and offload checkpoints to | S3? | | Asking because it seems V100 performance (or the other | colab paid GPU) is worth the occasional instability if | you've set up checkpoints. | byefruit wrote: | Alas, only if you live in the US. | | Colab Pro isn't available outside the US (without breaking | Google's terms). | polytronic wrote: | The author at 17 years of age can understand academics and | research. Has the skills and dedication to go through an exercise | of reconstructing state-of-the-art. | | I can't help but feel pride and hope for the future, both the | author's and the world. | fpgaminer wrote: | I was watching an ICML presentation and was surprised by the | presenter's (not OP, a different AI prodigy) apparent age. Well | turns out he was 17 and a 2nd year PhD student. I think he | graduated from UC Davis when he was 14 or something. | | Some people roll wicked real life DnD character sheets, that's | for sure. | deeviant wrote: | At home, in the cloud, for tens of thousands of $$$. | dane-pgp wrote: | "Mom, can I have a GPT-2?" | | "No, we have GPT-2 at home." | | GPT-2 at home: [Outputs this comment] | zirkonit wrote: | First off -- the author has done an amazing tutorial, it's very | enjoyable, so I am by no means throwing a shade. | | But a week of TPUv3-128 is anywhere between $10k and $20k in TPU | costs alone; saying that this is an "at home" kind of experiment | is cheeky at best, clickbait at worst. | nabla9 wrote: | Many hobbies cost $10k-$20k. If you work in engineering, that's | not far away from "at home" hobbies. | | The time that went into this project was almost certainly worth | more than $10k. | 6gvONxR4sf7o wrote: | I imagine you're speaking about the cost of e.g. setting up a | wood shop in your garage, rather than the cost of making | something in said wood shop. Training this seems more like | the latter, while the comparable cost is the former. | nabla9 wrote: | If you train this model and then use it to do other | interesting things, training big models is like a setting | up a wood shop. | fpgaminer wrote: | You can download a pretrained, full size GPT-2 for $0. | Training it from scratch would be merely for fun. You can | fine tune the model if you have a specific application | for far, far less cost ($0-$10). | | It's not comparable to a hobby. It's comparable to paying | $10k to make a sandwich. | Closi wrote: | If your hobby is building wood furniture, a wood shop | helps you do that hobby into the future. It will improve | your projects, and help your enjoyment of your hobby. The | tools also hold some sort of residual value. | | If your hobby is building AI/ML models, a one-shot | trained model isn't going to really help you on an | ongoing basis. It's an amazing single shot project, but | if your hobby is actually ML then you probably aren't | going to be happy just looking at your completed trained | model - you are going to want to train a bigger, better | model. | | And if your hobby is building software, you can just | download a pre-trained model for free. | | I don't think the analogy holds the other way. | bkkaggle wrote: | Hi, I love that you enjoyed it! | | Yeah I totally get your point about the title--the TPU quota | that I got was close to about the equivalent of $20k--but in my | defense I don't have any other access to compute beyond | anything that I get through the TFRC or through google colab | superasn wrote: | Yes it's an amazing tutorial. Thank you. | | Speaking as a hobbyist, earlier if you had enough | determination you could create just about any software if you | kept hacking at it long enough. CPU or cost was generally not | an issue, your time and tenacity was. | | This has now unfortunately changed and innovation in software | (esp ML) is now largely more about how deep are you pockets | are. | phreeza wrote: | I think this is quite a rose colored view of the past. | Rendering with many graphics techniques was out of reach | for hobbyists for a long time for example. | imaginenore wrote: | The point is, an average IT professional in the US can easily | afford it. ___________________________________________________________________ (page generated 2021-01-23 23:00 UTC)