[HN Gopher] Talking to myself: how I trained GPT2-1.5b for rubbe...
       ___________________________________________________________________
        
       Talking to myself: how I trained GPT2-1.5b for rubber ducking using
       my chat data
        
       Author : Tenoke
       Score  : 155 points
       Date   : 2020-01-23 16:09 UTC (6 hours ago)
        
 (HTM) web link (www.svilentodorov.xyz)
 (TXT) w3m dump (www.svilentodorov.xyz)
        
       | Tenoke wrote:
       | The post includes a link to a Colab where you can achieve the
       | same for free.
       | 
       | Warning though - it took me ~2 months of training (on and off) to
       | get it there.
        
         | sillysaurusx wrote:
         | I can't believe that someone actually used my TPU fork of gpt-2
         | to train 1.5B for months. That was the goal when I made it, but
         | I'm shocked someone actually put in the legwork to do it.
         | 
         | Well done!
         | 
         | What were some of the Colab pain points you ran into? Sometimes
         | Colab unmounts the drive folder for me, or fails to upload any
         | data until the runtime is reset. But those cases have been
         | pretty rare.
         | 
         | Did you have to micromanage disk space much? Google drive gives
         | lots of space, but it goes by pretty fast when each snapshot is
         | 5.6GB.
         | 
         | (Anything I can do to make this process easier? Feature
         | requests / fix requests are always welcome.)
        
           | Tenoke wrote:
           | Thanks again for making it possible!
           | 
           | >What were some of the Colab pain points you ran into?
           | 
           | You've thankfully added fixes for some of the big ones - like
           | how you cant just straight delete a file because it sends it
           | to the Drive's Thrash. Emptying them out is a nice approach.
           | 
           | Some of the big annoyances were having to keep the Colab tab
           | open on a machine at all times. Dealing with the leftover
           | small files. Drive adding encoding changes to files, thus
           | often making it hard to pull changes even if I git stash and
           | reset --hard. Occasional (though not that often overall)
           | complete stops for no reason - not even an error. Mounting
           | drive takes you to auth out of the notebook for no real
           | reason. Different lib versions between their GPU and TPU
           | runtimes. Nothing too big, really - just minor annoyances.
           | 
           | >Did you have to micromanage disk space much? Google drive
           | gives lots of space, but it goes by pretty fast when each
           | snapshot is 5.6GB.
           | 
           | Yes, so I bit the bullet and just paid a few $ for Google One
           | to save myself the trouble after a few weeks of dealing with
           | it.
           | 
           | >Anything I can do to make this process easier? Feature
           | requests / fix requests are always welcome
           | 
           | Add a better README. That would probably be the highest value
           | change you can make to the repo.
        
         | bhl wrote:
         | How'd you deal with continuously training with Google Colab?
         | I've noticed there's sometimes I/O errors when loading data
         | from large directories and runtime disconnects after a few
         | hours that force me to reauthorize Drive access manually.
        
           | Tenoke wrote:
           | Always having it open in a tab in a browser is a big one.
           | Working mostly from Drive and not being almost out of space
           | in the Colab's disk also helps. Make sure to not write over
           | the same files too many times but use different filenames
           | when writing - there are hidden quotas for
           | "downloading/uploading" a file which you can hit. I still got
           | disconnects occasionally but not often near the end.
           | 
           | They might've also made it a bit more stable at some point,
           | or I might have learned better how to avoid the Colab
           | pitfalls, not sure.
        
       | andai wrote:
       | My friend and I both trained GPT2 on our chat logs. It's mostly
       | just hilarious seeing what comes out of it, but I've actually
       | gotten real insight out of "hearing myself talk" -- it's similar
       | _enough_ to my personality that it shows me my interests, bad
       | habits etc. And we can ask each other questions, or write the
       | first half of an answer and see what comes out. It can be pretty
       | weird, but we've actually gotten some great advice out of it too.
       | (When you train it on your own text, it still keeps its "wisdom"
       | from the original model.)
       | 
       | If anyone wants to try, I used this colab thing (I don't even
       | need a GPU! Blows my mind that this is free)
       | 
       | https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRN...
       | 
       | If you use Colab it uploads your data to Google's servers. In
       | this case, they already had our chats (WhatsApp backup to Drive).
        
         | prophesi wrote:
         | I tried asking this in the Show HN thread on that exact colab
         | project, but how difficult would it be to set it up in your own
         | local Jupyter notebook if you're okay using your own GPU?
         | 
         | Edit: Ah, I see in another thread
         | (https://news.ycombinator.com/item?id=22129978) that your GPU
         | needs 11gb+ of VRAM to train the data, which my 1080 certainly
         | doesn't have. A friend of mine works at https://spell.run which
         | offers free trials for anyone interested in an alternative to
         | Google. I may give it a shot this weekend.
        
           | andai wrote:
           | https://www.gwern.net/GPT-2#training
           | 
           | My friend said he got it running on 8GB VRAM. But the first
           | time he ran it, I think it wasn't even using his GPU (it took
           | days instead of hours to train though).
        
       | SeanFerree wrote:
       | Love it!
        
       | supernintendo wrote:
       | This is so fun. A question for you (or anyone else familiar with
       | this topic), what hardware you would recommend for someone just
       | getting into training GPT2 models? Would a Radeon RX 580 be
       | enough?
        
         | minimaxir wrote:
         | You cannot train any GPT-2 models with an AMD GPU. Nvidia's
         | CUDA is still the de facto toolkit.
         | 
         | Either use Colab (free), or a preemptible GPU instance on GCE
         | w/ the Deep Learning VM image (relatively cheap). Using
         | consumer GPUs is a recipe for frustration.
        
           | Tenoke wrote:
           | >You cannot train any GPT-2 models with an AMD GPU.
           | 
           | It seems like you can. I know of at least one person who has
           | finetunned 1.5b on a 16GB AMD. I think u/sillysaurusx had
           | some part in it, but apparently translating the code from
           | CUDA was fairly easy.
        
       | sroussey wrote:
       | I want to train on my MacBook. What are the options?
        
         | ReverseCold wrote:
         | One of many GPU cloud providers (paperspace, lambda, etc). If
         | you want to do it for free you can use Google Colab. It won't
         | be fun to train this on a MacBook directly.
        
         | Tenoke wrote:
         | I include the link to the Colab, which means it's trained for
         | free on Google's machines, and you just access it from your
         | browser.
         | 
         | Of course, you might not want to have sensitive data on
         | Google's machines for one reason or another, in which case
         | you'd have to buy an external GPU, or better yet a whole other
         | machine.
        
           | minimaxir wrote:
           | Training the smallest GPT-2 model uses about 11-12GB of GPU
           | VRAM; consumer GPUs cap out at about 8GB.
           | 
           | GPT-2 1.5B will _definitely_ not train on a consumer GPU.
        
             | cyorir wrote:
             | Note that on the extreme end of consumer GPUs, there is the
             | 2080 Ti which comes with 11GB.
        
             | Tenoke wrote:
             | You can't train the full thing, but you can freeze
             | everything except the transformer layers (which is what
             | shawwwn and gwern do anyway even though they do have the
             | memory). You also need gradient checkpointing of course.
        
               | sroussey wrote:
               | Can anything be done on a mobile device yet?
        
               | Tenoke wrote:
               | Yes, there are a lot of modells designed to work okay on
               | mobile. Though you'd typically train in the cloud and
               | only use the trained model on the phone. Alternatively,
               | you can train over many phones, which brings a lot of
               | extra challenges but is definitely possible.
               | 
               | Google's very new Reformer[0] would likely be your best
               | bet if you want both something truly cutting-edge and
               | have less compute, even as little as a mobile's. As far
               | as I know, it hasn't been used on phones yet (again, it's
               | very new) but I bet it can be done.
               | 
               | 0. https://ai.googleblog.com/2020/01/reformer-efficient-
               | transfo...
        
               | sroussey wrote:
               | Interesting! Thank you for the link.
               | 
               | I don't mind training on a desktop and use it on both
               | desktop and mobile. We kinda already have that problem
               | since we parse Google data for a given android phone, but
               | it doesn't have the memory or compute for the amount of
               | data the phone has generated over the years. The user
               | will background the app too quickly. So we need to ask
               | the desktop app to do it, process there, and sync results
               | back.
        
           | sroussey wrote:
           | Yeah, I don't want to upload.
           | 
           | I would really like to have my app learn the user's speaking
           | style from their data and be able to write out diary entries
           | each day in their own "voice".
        
         | [deleted]
        
       | drcode wrote:
       | ...the moment where he jokes about "turning it on and off again"
       | and his GPT2 doppelganger laughs...
        
       | minimaxir wrote:
       | From anecdotal testing, using the 774M/1.5B GPT-2 models for
       | anything less than _hundreds of megabytes of input data_ will
       | result in _worse_ generation quality than using the smaller 124M
       | /355M models.
       | 
       | The addiction to the larger GPT-2 models is IMO a trap.
        
         | Tenoke wrote:
         | It's definitely not the case for me. I have models trained on
         | the same dataset which is 14mb (though I needed to tweak more
         | for the 1.5b).
         | 
         | 1.5b outperforms it here if trained long enough - in this case
         | 1-2 months as I was doing it all for free on Colab.
         | 
         | One of the big things was batching - it seems like nobody
         | really tries todo larger batches the biggest models, and
         | without batching but while having little data the model was
         | getting stuck.
        
           | MasterScrat wrote:
           | You trained (finetuned) GPT2 for 1-2 months on 14mb of data?
           | 
           | I don't understand how this doesn't massively overfit. How
           | long of these 1-2 months was the model actually training?
        
             | Tenoke wrote:
             | I train for maybe ~12 hours a day, some days, especially
             | around Christmas I didn't. I also lost a lot of days when
             | trying out different stuff or when the weights didn't save
             | to drive before the Colab timed out.
             | 
             | Having said that, I was training the full model with an
             | accumulated batch size for a while so it was taking > 10min
             | per step. I've also been using pretty low learning rates
             | for most of the latter stages.
             | 
             | Overall the model is currently at ~11k steps and the loss
             | can actually go down further but after playing with
             | different checkpoints last week, the best one didnt seem to
             | be the newest one so I left it at that one.
        
         | marmuel wrote:
         | Yep, I can fully confirm. Best results are _by far_ with
         | smaller models (such as openai-gpt).
        
         | sillysaurusx wrote:
         | AI dungeon was trained on 50mb. You might be overfitting. Don't
         | train for too long. You want to transfer knowledge, not replace
         | knowledge.
        
           | brokensegue wrote:
           | Are we talking about training from scratch or fine tuning?
        
             | sillysaurusx wrote:
             | Fine tuning. For training from scratch, you want a dataset
             | of at least 20GB gathered from all corners of the internet.
             | I think OpenAI used around 160GB.
             | 
             | Even if you can't process that much data, merely having it
             | available forces the model to learn a diverse variety of
             | knowledge.
             | 
             | The difficulty of training from scratch (and generating a
             | quality model) vs the difficulty of fine tuning is like the
             | difficulty of becoming fluent in emacs vs using notepad.
             | It's doable, but quality results take focused effort.
             | 
             | It's fun! Definitely within reach of lots of people who
             | wouldn't normally consider themselves data scientists / ML
             | engineers. (I'm one of 'em.)
        
       | mycall wrote:
       | > predict the next word in 40GB of Internet text
       | 
       | This could do wonders for lip reading correction.
        
         | britmob wrote:
         | OpenAI trained the initial 1.5B model on ~160G of text.. so I'm
         | sure it's already going to give amazing results.
        
       | siavosh wrote:
       | These tinker use cases of GPT2 (including the dungeon game) are
       | amazing to see. As the model improves, makes me think of
       | essentially everyone having access to a conversational Einstein,
       | Lincoln etc...instant friends/advisors from history.
        
       | kqr wrote:
       | This is like a personalised version of Oblique Strategies.
       | Exciting!
        
       | gambler wrote:
       | Can't wait until chat bots trained on someone's messages are used
       | as "evidence" of what that person thinks. It's blatantly obvious
       | that the crowd here would accept this as valid analysis if the
       | whole thing is peppered with appropriate buzzwords.
        
       | fapjacks wrote:
       | There have been a number of posts over the last few days like
       | this about giving (more) of your (sensitive) data to Google. Lots
       | of comments in the threads about exporting and uploading messages
       | from e.g. WhatsApp and Telegram, and a surprising lack of concern
       | about it.
        
       | fredley wrote:
       | Straight out of the _Black Mirror_ episode _Be Right Back_ [0]
       | which is 7 years old.
       | 
       | [SPOILERS] In the episode the episode, the main character uses a
       | service to reconstruct a chat bot (and eventually a lifelike
       | avatar) built from her dead partner's social media history.
       | Eventually she becomes frustrated by the lack of depth (since
       | it's only trained on social media data, it falls into a sort of
       | uncanny valley of comprehension and personality), but can't part
       | with it, confining it to the attic of her home.
       | 
       | [0]: https://en.wikipedia.org/wiki/Be_Right_Back
        
       | unnouinceput wrote:
       | Quote: "The conversations aren't ideal ..."
       | 
       | Hi Tenoke, you got it wrong. It will never be ideal, no matter
       | what. And I think the opposite, those examples are actually quite
       | ideal, you see yourself from different perspective in the same
       | way everybody reacts to hearing their voice. You sound to you
       | different then what people around you hear. You just "heard" your
       | AI, as crude as you think it is, for the 1st time. Thank you for
       | this, don't mind if I grab everything you did and do it for
       | myself as well. This is going to be fun!
        
       | mirimir wrote:
       | Hey, I just talk to myself ;)
       | 
       | Sometimes I use different voices, for emphasis.
       | 
       | I actually learned that in a course. The context was having a
       | completion conversation with someone who had died. But it works
       | in other contexts too.
        
       | blazespin wrote:
       | I would be curious to know how much when we write, how much of it
       | is self-attention and how much of it is our fore-brain actually
       | trying to make sense? My guess is that the more tired / rushed /
       | burned out you are, the % of self attention increases.
       | 
       | Sometimes watching the news, it seems like 90% of what they say
       | when they are 'vamping' is just self-attention.
       | 
       | Has anyone posted any GPT / Hacker News generated text yet?
       | Wisdom of the crowds, indeed. It'd be interesting to post using
       | it with light editing, especially something that uses upvotes for
       | training.
       | 
       | One of the things I was thinking about was training on your
       | favorite novel, so you could have a sort of conversation with it
       | / ask it questions. A kind of interactive cliff notes. However,
       | as looked into it I realized it was still too much of a markov
       | chain like thing to be functionally useful. Fun idea though.
       | 
       | The real win, in all of this, of course is auto completion in
       | different mediums. Code completion demos are pretty wild -
       | https://tabnine.com/blog/deep/ Come to think about it, you could
       | probably use it for writing academic papers as well assuming you
       | know the content well.
       | 
       | Self-Attention and Human/Computer interaction is a very brave new
       | world. I don't think people really yet know the potential for
       | seismic shift here.
        
         | minimaxir wrote:
         | I have a very large dump of GPT-2 generated Hacker News titles
         | here: https://github.com/minimaxir/hacker-news-gpt-2
         | 
         | That's just with the smallest 124M model though; on short form
         | content especially, I'm not convinced of the value of larger
         | models.
        
         | leod wrote:
         | I've trained a Transformer encoder-decoder model (this was
         | slightly before GPT2 came out) to generate HN comments from
         | titles. There is a demo running at https://hncynic.leod.org
        
           | rahimnathwani wrote:
           | This is cool. If you were to cache the results and generate a
           | unique URL for each, people could easily share the funniest
           | ones.
        
             | leod wrote:
             | Thanks! I actually planned to make results shareable at the
             | start, but, knowing the internet, I did not like the idea
             | of being held responsible for whatever content (say
             | offensive or even illegal things) people would put into the
             | titles.
        
           | CDSlice wrote:
           | It doesn't seem very accurate, there isn't close to enough
           | Electron hate whenever it is in the title.
           | 
           | This is pure gold though:
           | 
           | > How does one make a web app using a standard framework?
           | I've never used it, but it sounds like someone has been able
           | to put together something like a Web app with only one app.
           | 
           | Edit: This is even better.
           | 
           | > Rewriting a Linux kernel in Rust, by hand, is definitely
           | the right thing to do as a beginner/intermediate programmer.
        
             | leod wrote:
             | Ha! In the model's defense, its training data [1] ends in
             | 2017 -- not sure if hatred for Electron was as prevalent
             | back then.
             | 
             | [1] https://archive.org/details/14566367HackerNewsCommentsA
             | ndSto...
        
             | jandrese wrote:
             | > Rewriting a Linux kernel in Rust, by hand, is definitely
             | the right thing to do as a beginner/intermediate
             | programmer.
             | 
             | Absolute perfection.
        
             | nekopa wrote:
             | My favorite, even sounds like it would work:
             | 
             | Title: My Emacs Productivity Tricks/Hacks
             | 
             | hncynic 1 minute ago
             | 
             | I used this for some time and never looked back.
             | 
             | In my .emacs.d file, the arrow keys, a key with a cursor
             | keys (which are the key bindings for the .emacs.d file
             | above) and then a shortcut to switch to the command that
             | makes use of those.
             | 
             | But I now have a full screen keyboard and mouse.
             | 
             | Here's another way to do it:
             | 
             | M-x { C-c }
             | 
             | You go in the current directory, move up the left arrow
             | key, press escape and hit the backspace key.
        
       | leblancfg wrote:
       | > Fun fact - there is a sentence in this post written entirely by
       | the GPT version of Me. I wonder how easy it is to spot.
       | 
       | ...I couldn't spot it. Anyone? Eerie...
        
         | jerf wrote:
         | My best guess is "Additionally it sometimes talks about things
         | that aren't really True - like the back pain in Example 1, and
         | if you play with the different parameters (top_k/top_p and
         | temperature mainly) you can force it to go on long tirades
         | which eventually become nonsensical."
         | 
         | True shouldn't be capitalized like that (influence from sample
         | Python code or another language that uses "True"?), and Example
         | 1 doesn't discuss back pain. I don't know enough about GPT or
         | whatever other possible models may be getting discussed to know
         | whether "top_k/top_p" make sense, though temperature would seem
         | to.
        
       | qnxub wrote:
       | Is this the best way to create a chatbot with my personality? I
       | feel like I would want to fine tune some things so it is giving
       | real responses about my preferences, hobbies, etc.
       | 
       | My use case is preserving my personality for loved ones after I
       | die.
        
       ___________________________________________________________________
       (page generated 2020-01-23 23:00 UTC)