[HN Gopher] Replicating GPT-2 at Home
       ___________________________________________________________________
        
       Replicating GPT-2 at Home
        
       Author : bkkaggle
       Score  : 174 points
       Date   : 2021-01-23 16:52 UTC (6 hours ago)
        
 (HTM) web link (bkkaggle.github.io)
 (TXT) w3m dump (bkkaggle.github.io)
        
       | kyberias wrote:
       | How many off-the-shelf GPUs are needed to replicate GPT-2 in a
       | year?
        
         | minimaxir wrote:
         | With current improvements to training performance and
         | parallelism (e.g. DeepSpeed: https://www.deepspeed.ai ) it
         | wouldn't surprise me if creating GPT-2 small from scratch
         | becomes possible with a couple 3080s in _days_ , with GPT-2 XL
         | not taking 10x longer.
        
           | moyix wrote:
           | I agree. I've been training on 2x3090s connected via NVLink
           | and they're _really_ fast for training language models. I am
           | actually tempted to try and replicate the OP 's GPT2
           | replication using Huggingface, DeepSpeed, and OpenWebText,
           | but the GPUs are occupied right now training a GPT2-774M C
           | language model...
        
             | Jack000 wrote:
             | Does nvlink actually help? It's mostly useful for
             | transferring data between gpus so I assume you're using
             | pipeline parallelism or similar?
        
             | natch wrote:
             | What software stack are you using to get your 3090s
             | working? Any hitches along the way?
        
               | moyix wrote:
               | Linux (Ubuntu 20.04) + Cuda 11.2. For the backend I use
               | PyTorch; Tensorflow has some nice optimizations (like
               | XLA, which uses LLVM to JIT optimized code for the GPU),
               | but I found it very painful to get working reliably, and
               | most of the language modeling stuff I've seen uses
               | PyTorch.
               | 
               | For the language model training itself I've been
               | experimenting with a few different things. I started off
               | with Huggingface because it's very easy to get up and
               | running, and I still use its tokenizers library to do BPE
               | training on the C source dataset (though there are still
               | some hitches there - other libraries expect slightly
               | different formats for the tokenizer model, like using
               | different ways to represent the <|endoftext|> marker).
               | 
               | After prototyping the C language model training at home,
               | I tried moving the training up to NYU's HPC cluster,
               | which has a bunch of 4xV100 and 4xRTX8000 nodes (mainly
               | because the sound of two powerful GPU fans running at
               | 100% gets a bit old after a while). Unfortunately I
               | discovered that with larger models the GPU-GPU
               | communication overhead can be prohibitive (most of the
               | cluster nodes only support P2P GPU communication over
               | PCIe, which is a _lot_ slower than NVLink), and
               | Huggingface 's implementation actually performed _worse_
               | on multiple GPUs than on two 3090s with NVLink (I opened
               | an issue track it here
               | https://github.com/huggingface/transformers/issues/9371
               | ).
               | 
               | Currently I'm working on getting DeepSpeed running so
               | that I can hopefully get better scaling even in the
               | absence of a fast GPU-GPU interconnect. This is again a
               | little bit annoying, because it seems like every
               | framework wants a slightly different way of representing
               | the tokenizer and training data - I've had to preprocess
               | the dataset in about 4 different ways (plain text, loose
               | JSON, npy (for DeepSpeed), and a custom indexed binary
               | format for Megatron-LM). I'm also hoping to try out
               | Huggingface's recently-released DeepSpeed integration,
               | which (if it works) would be a really nice combination of
               | usability and performance:
               | https://huggingface.co/blog/zero-deepspeed-fairscale
               | 
               | As for other software stack hitches: so, so many. The
               | main one is just managing the different versions of CUDA.
               | The 3090 is only supported starting with CUDA 11.1, but
               | many packages and frameworks only support 11.0 at best.
               | And some of the newer things like DeepSpeed use PyTorch
               | extensions, which require you to have the exact version
               | of CUDA around that was used to build PyTorch. So I've
               | had to do a fair bit of compiling packages from source
               | rather than relying on prebuilt packages.
               | 
               | The path of least resistance here is probably to use the
               | NVIDIA NGC containers, but it took NVIDIA more than a
               | month to get them updated after the 3090 was released,
               | and I find working inside containers for everything
               | inconvenient anyway (I hate losing my bash history, and I
               | always accidentally end up losing data or local changes
               | when I exit a container).
               | 
               | Anyway, this ended up being a bit more rambling than I
               | intended, but it was helpful to write it all down and
               | maybe it'll help someone else avoid some stumbling blocks
               | :)
        
       | minimaxir wrote:
       | As someone who maintains a package to both make it easy to fine-
       | tune GPT-2 or create your own from scratch
       | (https://github.com/minimaxir/aitextgen), this submission is a
       | good run-through of the technical considerations toward building
       | a GPT-2 model.
       | 
       | It's both substantially easier and faster than it was when OpenAI
       | released their paper in 2019, thanks to both Huggingface
       | Transformers and Tokenizers making the architectures more
       | efficient and other companies streamlining the training process
       | and make it more efficient for all parts in the pipeline.
       | 
       | You don't need a TPU cluster to train a working GPT-2 model,
       | although it helps (unfortunately TPU support on PyTorch-based
       | training like aitextgen is more fussy). A free GPU on Colab gets
       | you most of the way, especially since you can get now a T4 or a
       | V100 which lets you use FP16.
        
         | bkkaggle wrote:
         | Yep i started off with trying to get it to work with pytorch
         | (https://github.com/bkkaggle/lm-training-research-
         | project/blo...) then with pt-lightning but the whole 1 user VM
         | per TPU board limitation in pytorch-xla 7-8 months ago made me
         | switch over to TF
        
           | punnerud wrote:
           | Just as Google want you to do. Within 3-5 years you will
           | probably see a high price increase and no where to go.
        
             | bkkaggle wrote:
             | heh. I've been using jax for a couple of months and its
             | been a pretty nice replacement of both pt and tf. it feels
             | like what a ml framework would look like if it were built
             | around easy scaling and dev friendliness.
        
         | bravura wrote:
         | What do you think would be necessary to generate rhyming text
         | with a particular phrasing / rhythm?
         | 
         | e.g. in the style of a particular rapper?
         | 
         | If you just fine-tune on a corpus of their lyrics, you might
         | miss the underlying poetic constraints.
         | 
         | If there were an additional prior (a "poetry / assonance /
         | rhyme" model), what is the easiest way to constrain generation
         | to respect this prior?
         | 
         | Thanks!
        
           | drusepth wrote:
           | I wrote "Stylistic Rhyme-bound Poetry Generation or: How You
           | Too Can Generate Sonnets in the Style of Kanye West" [1] back
           | in 2017 for an easy DIY introduction to this topic. You
           | specify the rhyming scheme (ABAB CDCD etc) and it forces end-
           | line rhymes around it.
           | 
           | It uses Markov chains instead of GPT-2, but the approach
           | should work with prompt-based things like GPT-2 also: for
           | lines that are "free" (e.g. no specific word you need to
           | rhyme with), you can generate the line normally -- but for
           | lines you need to rhyme with a specific word, you can just
           | generate last-word-first and generate backwards. For a
           | strictly LTR prompt like GPT-2, you could probably just
           | reverse your corpus word order, generate "reverse" lines with
           | GPT-2 given the previous line + word you need to rhyme with
           | as the prompt, and then reverse it back to "normal" in
           | postprocessing.
           | 
           | [1] https://festivalpeak.com/stylistic-rhyme-bound-poetry-
           | genera...
           | 
           | Some examples of the output of this approach:
           | 
           | [2] https://medium.com/words-of-mimicry/kanye-west-
           | ballade-1-a6f...
           | 
           | [3] https://medium.com/words-of-mimicry/me-you-and-slow-sip-
           | slow...
           | 
           | I'd expect the output to be better with something like
           | GPT-2/3, since Markov chains are so twentieth-century, but I
           | was pretty happy at the output quality even though it often
           | rhymed the same word repeatedly; you could improve it by
           | weighting previously-used words, removing them from the pool
           | of rhyming words, and/or backtracking to previous lines when
           | you find yourself without other words to rhyme.
        
           | minimaxir wrote:
           | A paper was recently released for that particular use case
           | (https://github.com/markriedl/weirdai), in which it describes
           | a number of technical caveats (and it's technically not using
           | GPT-2).
           | 
           | I do think it's possible to train a GPT-2-esque network to do
           | something similar, albeit with some text encoding
           | shenanigans.
        
         | FL33TW00D wrote:
         | As far as I know to get a V100 you need Colab Pro? Did this
         | change recently?
        
           | minimaxir wrote:
           | It's unclear. I've heard people get the V100 without Colab
           | Pro. Albeit I do use Colab Pro and get a V100 almost each
           | time.
           | 
           | As an aside, if you do get a V100, Colab Pro is by-far the
           | cheapest way to train an AI model. ($10/mo is much, much
           | cheaper than $2.48+/hr on GCP normally!) Although you need to
           | sync checkpoints to off-loaded storage in case the Notebook
           | dies.
        
             | fpgaminer wrote:
             | > As an aside, if you do get a V100, Colab Pro is by-far
             | the cheapest way to train an AI model.
             | 
             | But others should be aware that you get what you pay for.
             | Google still rate limited me when I used Colab Pro, and I
             | ran into a myriad of other small problems. If that's all
             | one is willing to spend to play with AI, 100% go for it.
             | It's a great place to start. But if you're at all serious
             | and can afford it, I think a local machine with a modest
             | GPU is worth every penny.
        
               | nsomaru wrote:
               | Curious; is it better to train locally on something like
               | a 2080ti 11G or go for colab and offload checkpoints to
               | S3?
               | 
               | Asking because it seems V100 performance (or the other
               | colab paid GPU) is worth the occasional instability if
               | you've set up checkpoints.
        
             | byefruit wrote:
             | Alas, only if you live in the US.
             | 
             | Colab Pro isn't available outside the US (without breaking
             | Google's terms).
        
       | polytronic wrote:
       | The author at 17 years of age can understand academics and
       | research. Has the skills and dedication to go through an exercise
       | of reconstructing state-of-the-art.
       | 
       | I can't help but feel pride and hope for the future, both the
       | author's and the world.
        
         | fpgaminer wrote:
         | I was watching an ICML presentation and was surprised by the
         | presenter's (not OP, a different AI prodigy) apparent age. Well
         | turns out he was 17 and a 2nd year PhD student. I think he
         | graduated from UC Davis when he was 14 or something.
         | 
         | Some people roll wicked real life DnD character sheets, that's
         | for sure.
        
       | deeviant wrote:
       | At home, in the cloud, for tens of thousands of $$$.
        
         | dane-pgp wrote:
         | "Mom, can I have a GPT-2?"
         | 
         | "No, we have GPT-2 at home."
         | 
         | GPT-2 at home: [Outputs this comment]
        
       | zirkonit wrote:
       | First off -- the author has done an amazing tutorial, it's very
       | enjoyable, so I am by no means throwing a shade.
       | 
       | But a week of TPUv3-128 is anywhere between $10k and $20k in TPU
       | costs alone; saying that this is an "at home" kind of experiment
       | is cheeky at best, clickbait at worst.
        
         | nabla9 wrote:
         | Many hobbies cost $10k-$20k. If you work in engineering, that's
         | not far away from "at home" hobbies.
         | 
         | The time that went into this project was almost certainly worth
         | more than $10k.
        
           | 6gvONxR4sf7o wrote:
           | I imagine you're speaking about the cost of e.g. setting up a
           | wood shop in your garage, rather than the cost of making
           | something in said wood shop. Training this seems more like
           | the latter, while the comparable cost is the former.
        
             | nabla9 wrote:
             | If you train this model and then use it to do other
             | interesting things, training big models is like a setting
             | up a wood shop.
        
               | fpgaminer wrote:
               | You can download a pretrained, full size GPT-2 for $0.
               | Training it from scratch would be merely for fun. You can
               | fine tune the model if you have a specific application
               | for far, far less cost ($0-$10).
               | 
               | It's not comparable to a hobby. It's comparable to paying
               | $10k to make a sandwich.
        
               | Closi wrote:
               | If your hobby is building wood furniture, a wood shop
               | helps you do that hobby into the future. It will improve
               | your projects, and help your enjoyment of your hobby. The
               | tools also hold some sort of residual value.
               | 
               | If your hobby is building AI/ML models, a one-shot
               | trained model isn't going to really help you on an
               | ongoing basis. It's an amazing single shot project, but
               | if your hobby is actually ML then you probably aren't
               | going to be happy just looking at your completed trained
               | model - you are going to want to train a bigger, better
               | model.
               | 
               | And if your hobby is building software, you can just
               | download a pre-trained model for free.
               | 
               | I don't think the analogy holds the other way.
        
         | bkkaggle wrote:
         | Hi, I love that you enjoyed it!
         | 
         | Yeah I totally get your point about the title--the TPU quota
         | that I got was close to about the equivalent of $20k--but in my
         | defense I don't have any other access to compute beyond
         | anything that I get through the TFRC or through google colab
        
           | superasn wrote:
           | Yes it's an amazing tutorial. Thank you.
           | 
           | Speaking as a hobbyist, earlier if you had enough
           | determination you could create just about any software if you
           | kept hacking at it long enough. CPU or cost was generally not
           | an issue, your time and tenacity was.
           | 
           | This has now unfortunately changed and innovation in software
           | (esp ML) is now largely more about how deep are you pockets
           | are.
        
             | phreeza wrote:
             | I think this is quite a rose colored view of the past.
             | Rendering with many graphics techniques was out of reach
             | for hobbyists for a long time for example.
        
         | imaginenore wrote:
         | The point is, an average IT professional in the US can easily
         | afford it.
        
       ___________________________________________________________________
       (page generated 2021-01-23 23:00 UTC)