[HN Gopher] Talking to myself: how I trained GPT2-1.5b for rubbe... ___________________________________________________________________ Talking to myself: how I trained GPT2-1.5b for rubber ducking using my chat data Author : Tenoke Score : 155 points Date : 2020-01-23 16:09 UTC (6 hours ago) (HTM) web link (www.svilentodorov.xyz) (TXT) w3m dump (www.svilentodorov.xyz) | Tenoke wrote: | The post includes a link to a Colab where you can achieve the | same for free. | | Warning though - it took me ~2 months of training (on and off) to | get it there. | sillysaurusx wrote: | I can't believe that someone actually used my TPU fork of gpt-2 | to train 1.5B for months. That was the goal when I made it, but | I'm shocked someone actually put in the legwork to do it. | | Well done! | | What were some of the Colab pain points you ran into? Sometimes | Colab unmounts the drive folder for me, or fails to upload any | data until the runtime is reset. But those cases have been | pretty rare. | | Did you have to micromanage disk space much? Google drive gives | lots of space, but it goes by pretty fast when each snapshot is | 5.6GB. | | (Anything I can do to make this process easier? Feature | requests / fix requests are always welcome.) | Tenoke wrote: | Thanks again for making it possible! | | >What were some of the Colab pain points you ran into? | | You've thankfully added fixes for some of the big ones - like | how you cant just straight delete a file because it sends it | to the Drive's Thrash. Emptying them out is a nice approach. | | Some of the big annoyances were having to keep the Colab tab | open on a machine at all times. Dealing with the leftover | small files. Drive adding encoding changes to files, thus | often making it hard to pull changes even if I git stash and | reset --hard. Occasional (though not that often overall) | complete stops for no reason - not even an error. Mounting | drive takes you to auth out of the notebook for no real | reason. Different lib versions between their GPU and TPU | runtimes. Nothing too big, really - just minor annoyances. | | >Did you have to micromanage disk space much? Google drive | gives lots of space, but it goes by pretty fast when each | snapshot is 5.6GB. | | Yes, so I bit the bullet and just paid a few $ for Google One | to save myself the trouble after a few weeks of dealing with | it. | | >Anything I can do to make this process easier? Feature | requests / fix requests are always welcome | | Add a better README. That would probably be the highest value | change you can make to the repo. | bhl wrote: | How'd you deal with continuously training with Google Colab? | I've noticed there's sometimes I/O errors when loading data | from large directories and runtime disconnects after a few | hours that force me to reauthorize Drive access manually. | Tenoke wrote: | Always having it open in a tab in a browser is a big one. | Working mostly from Drive and not being almost out of space | in the Colab's disk also helps. Make sure to not write over | the same files too many times but use different filenames | when writing - there are hidden quotas for | "downloading/uploading" a file which you can hit. I still got | disconnects occasionally but not often near the end. | | They might've also made it a bit more stable at some point, | or I might have learned better how to avoid the Colab | pitfalls, not sure. | andai wrote: | My friend and I both trained GPT2 on our chat logs. It's mostly | just hilarious seeing what comes out of it, but I've actually | gotten real insight out of "hearing myself talk" -- it's similar | _enough_ to my personality that it shows me my interests, bad | habits etc. And we can ask each other questions, or write the | first half of an answer and see what comes out. It can be pretty | weird, but we've actually gotten some great advice out of it too. | (When you train it on your own text, it still keeps its "wisdom" | from the original model.) | | If anyone wants to try, I used this colab thing (I don't even | need a GPU! Blows my mind that this is free) | | https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRN... | | If you use Colab it uploads your data to Google's servers. In | this case, they already had our chats (WhatsApp backup to Drive). | prophesi wrote: | I tried asking this in the Show HN thread on that exact colab | project, but how difficult would it be to set it up in your own | local Jupyter notebook if you're okay using your own GPU? | | Edit: Ah, I see in another thread | (https://news.ycombinator.com/item?id=22129978) that your GPU | needs 11gb+ of VRAM to train the data, which my 1080 certainly | doesn't have. A friend of mine works at https://spell.run which | offers free trials for anyone interested in an alternative to | Google. I may give it a shot this weekend. | andai wrote: | https://www.gwern.net/GPT-2#training | | My friend said he got it running on 8GB VRAM. But the first | time he ran it, I think it wasn't even using his GPU (it took | days instead of hours to train though). | SeanFerree wrote: | Love it! | supernintendo wrote: | This is so fun. A question for you (or anyone else familiar with | this topic), what hardware you would recommend for someone just | getting into training GPT2 models? Would a Radeon RX 580 be | enough? | minimaxir wrote: | You cannot train any GPT-2 models with an AMD GPU. Nvidia's | CUDA is still the de facto toolkit. | | Either use Colab (free), or a preemptible GPU instance on GCE | w/ the Deep Learning VM image (relatively cheap). Using | consumer GPUs is a recipe for frustration. | Tenoke wrote: | >You cannot train any GPT-2 models with an AMD GPU. | | It seems like you can. I know of at least one person who has | finetunned 1.5b on a 16GB AMD. I think u/sillysaurusx had | some part in it, but apparently translating the code from | CUDA was fairly easy. | sroussey wrote: | I want to train on my MacBook. What are the options? | ReverseCold wrote: | One of many GPU cloud providers (paperspace, lambda, etc). If | you want to do it for free you can use Google Colab. It won't | be fun to train this on a MacBook directly. | Tenoke wrote: | I include the link to the Colab, which means it's trained for | free on Google's machines, and you just access it from your | browser. | | Of course, you might not want to have sensitive data on | Google's machines for one reason or another, in which case | you'd have to buy an external GPU, or better yet a whole other | machine. | minimaxir wrote: | Training the smallest GPT-2 model uses about 11-12GB of GPU | VRAM; consumer GPUs cap out at about 8GB. | | GPT-2 1.5B will _definitely_ not train on a consumer GPU. | cyorir wrote: | Note that on the extreme end of consumer GPUs, there is the | 2080 Ti which comes with 11GB. | Tenoke wrote: | You can't train the full thing, but you can freeze | everything except the transformer layers (which is what | shawwwn and gwern do anyway even though they do have the | memory). You also need gradient checkpointing of course. | sroussey wrote: | Can anything be done on a mobile device yet? | Tenoke wrote: | Yes, there are a lot of modells designed to work okay on | mobile. Though you'd typically train in the cloud and | only use the trained model on the phone. Alternatively, | you can train over many phones, which brings a lot of | extra challenges but is definitely possible. | | Google's very new Reformer[0] would likely be your best | bet if you want both something truly cutting-edge and | have less compute, even as little as a mobile's. As far | as I know, it hasn't been used on phones yet (again, it's | very new) but I bet it can be done. | | 0. https://ai.googleblog.com/2020/01/reformer-efficient- | transfo... | sroussey wrote: | Interesting! Thank you for the link. | | I don't mind training on a desktop and use it on both | desktop and mobile. We kinda already have that problem | since we parse Google data for a given android phone, but | it doesn't have the memory or compute for the amount of | data the phone has generated over the years. The user | will background the app too quickly. So we need to ask | the desktop app to do it, process there, and sync results | back. | sroussey wrote: | Yeah, I don't want to upload. | | I would really like to have my app learn the user's speaking | style from their data and be able to write out diary entries | each day in their own "voice". | [deleted] | drcode wrote: | ...the moment where he jokes about "turning it on and off again" | and his GPT2 doppelganger laughs... | minimaxir wrote: | From anecdotal testing, using the 774M/1.5B GPT-2 models for | anything less than _hundreds of megabytes of input data_ will | result in _worse_ generation quality than using the smaller 124M | /355M models. | | The addiction to the larger GPT-2 models is IMO a trap. | Tenoke wrote: | It's definitely not the case for me. I have models trained on | the same dataset which is 14mb (though I needed to tweak more | for the 1.5b). | | 1.5b outperforms it here if trained long enough - in this case | 1-2 months as I was doing it all for free on Colab. | | One of the big things was batching - it seems like nobody | really tries todo larger batches the biggest models, and | without batching but while having little data the model was | getting stuck. | MasterScrat wrote: | You trained (finetuned) GPT2 for 1-2 months on 14mb of data? | | I don't understand how this doesn't massively overfit. How | long of these 1-2 months was the model actually training? | Tenoke wrote: | I train for maybe ~12 hours a day, some days, especially | around Christmas I didn't. I also lost a lot of days when | trying out different stuff or when the weights didn't save | to drive before the Colab timed out. | | Having said that, I was training the full model with an | accumulated batch size for a while so it was taking > 10min | per step. I've also been using pretty low learning rates | for most of the latter stages. | | Overall the model is currently at ~11k steps and the loss | can actually go down further but after playing with | different checkpoints last week, the best one didnt seem to | be the newest one so I left it at that one. | marmuel wrote: | Yep, I can fully confirm. Best results are _by far_ with | smaller models (such as openai-gpt). | sillysaurusx wrote: | AI dungeon was trained on 50mb. You might be overfitting. Don't | train for too long. You want to transfer knowledge, not replace | knowledge. | brokensegue wrote: | Are we talking about training from scratch or fine tuning? | sillysaurusx wrote: | Fine tuning. For training from scratch, you want a dataset | of at least 20GB gathered from all corners of the internet. | I think OpenAI used around 160GB. | | Even if you can't process that much data, merely having it | available forces the model to learn a diverse variety of | knowledge. | | The difficulty of training from scratch (and generating a | quality model) vs the difficulty of fine tuning is like the | difficulty of becoming fluent in emacs vs using notepad. | It's doable, but quality results take focused effort. | | It's fun! Definitely within reach of lots of people who | wouldn't normally consider themselves data scientists / ML | engineers. (I'm one of 'em.) | mycall wrote: | > predict the next word in 40GB of Internet text | | This could do wonders for lip reading correction. | britmob wrote: | OpenAI trained the initial 1.5B model on ~160G of text.. so I'm | sure it's already going to give amazing results. | siavosh wrote: | These tinker use cases of GPT2 (including the dungeon game) are | amazing to see. As the model improves, makes me think of | essentially everyone having access to a conversational Einstein, | Lincoln etc...instant friends/advisors from history. | kqr wrote: | This is like a personalised version of Oblique Strategies. | Exciting! | gambler wrote: | Can't wait until chat bots trained on someone's messages are used | as "evidence" of what that person thinks. It's blatantly obvious | that the crowd here would accept this as valid analysis if the | whole thing is peppered with appropriate buzzwords. | fapjacks wrote: | There have been a number of posts over the last few days like | this about giving (more) of your (sensitive) data to Google. Lots | of comments in the threads about exporting and uploading messages | from e.g. WhatsApp and Telegram, and a surprising lack of concern | about it. | fredley wrote: | Straight out of the _Black Mirror_ episode _Be Right Back_ [0] | which is 7 years old. | | [SPOILERS] In the episode the episode, the main character uses a | service to reconstruct a chat bot (and eventually a lifelike | avatar) built from her dead partner's social media history. | Eventually she becomes frustrated by the lack of depth (since | it's only trained on social media data, it falls into a sort of | uncanny valley of comprehension and personality), but can't part | with it, confining it to the attic of her home. | | [0]: https://en.wikipedia.org/wiki/Be_Right_Back | unnouinceput wrote: | Quote: "The conversations aren't ideal ..." | | Hi Tenoke, you got it wrong. It will never be ideal, no matter | what. And I think the opposite, those examples are actually quite | ideal, you see yourself from different perspective in the same | way everybody reacts to hearing their voice. You sound to you | different then what people around you hear. You just "heard" your | AI, as crude as you think it is, for the 1st time. Thank you for | this, don't mind if I grab everything you did and do it for | myself as well. This is going to be fun! | mirimir wrote: | Hey, I just talk to myself ;) | | Sometimes I use different voices, for emphasis. | | I actually learned that in a course. The context was having a | completion conversation with someone who had died. But it works | in other contexts too. | blazespin wrote: | I would be curious to know how much when we write, how much of it | is self-attention and how much of it is our fore-brain actually | trying to make sense? My guess is that the more tired / rushed / | burned out you are, the % of self attention increases. | | Sometimes watching the news, it seems like 90% of what they say | when they are 'vamping' is just self-attention. | | Has anyone posted any GPT / Hacker News generated text yet? | Wisdom of the crowds, indeed. It'd be interesting to post using | it with light editing, especially something that uses upvotes for | training. | | One of the things I was thinking about was training on your | favorite novel, so you could have a sort of conversation with it | / ask it questions. A kind of interactive cliff notes. However, | as looked into it I realized it was still too much of a markov | chain like thing to be functionally useful. Fun idea though. | | The real win, in all of this, of course is auto completion in | different mediums. Code completion demos are pretty wild - | https://tabnine.com/blog/deep/ Come to think about it, you could | probably use it for writing academic papers as well assuming you | know the content well. | | Self-Attention and Human/Computer interaction is a very brave new | world. I don't think people really yet know the potential for | seismic shift here. | minimaxir wrote: | I have a very large dump of GPT-2 generated Hacker News titles | here: https://github.com/minimaxir/hacker-news-gpt-2 | | That's just with the smallest 124M model though; on short form | content especially, I'm not convinced of the value of larger | models. | leod wrote: | I've trained a Transformer encoder-decoder model (this was | slightly before GPT2 came out) to generate HN comments from | titles. There is a demo running at https://hncynic.leod.org | rahimnathwani wrote: | This is cool. If you were to cache the results and generate a | unique URL for each, people could easily share the funniest | ones. | leod wrote: | Thanks! I actually planned to make results shareable at the | start, but, knowing the internet, I did not like the idea | of being held responsible for whatever content (say | offensive or even illegal things) people would put into the | titles. | CDSlice wrote: | It doesn't seem very accurate, there isn't close to enough | Electron hate whenever it is in the title. | | This is pure gold though: | | > How does one make a web app using a standard framework? | I've never used it, but it sounds like someone has been able | to put together something like a Web app with only one app. | | Edit: This is even better. | | > Rewriting a Linux kernel in Rust, by hand, is definitely | the right thing to do as a beginner/intermediate programmer. | leod wrote: | Ha! In the model's defense, its training data [1] ends in | 2017 -- not sure if hatred for Electron was as prevalent | back then. | | [1] https://archive.org/details/14566367HackerNewsCommentsA | ndSto... | jandrese wrote: | > Rewriting a Linux kernel in Rust, by hand, is definitely | the right thing to do as a beginner/intermediate | programmer. | | Absolute perfection. | nekopa wrote: | My favorite, even sounds like it would work: | | Title: My Emacs Productivity Tricks/Hacks | | hncynic 1 minute ago | | I used this for some time and never looked back. | | In my .emacs.d file, the arrow keys, a key with a cursor | keys (which are the key bindings for the .emacs.d file | above) and then a shortcut to switch to the command that | makes use of those. | | But I now have a full screen keyboard and mouse. | | Here's another way to do it: | | M-x { C-c } | | You go in the current directory, move up the left arrow | key, press escape and hit the backspace key. | leblancfg wrote: | > Fun fact - there is a sentence in this post written entirely by | the GPT version of Me. I wonder how easy it is to spot. | | ...I couldn't spot it. Anyone? Eerie... | jerf wrote: | My best guess is "Additionally it sometimes talks about things | that aren't really True - like the back pain in Example 1, and | if you play with the different parameters (top_k/top_p and | temperature mainly) you can force it to go on long tirades | which eventually become nonsensical." | | True shouldn't be capitalized like that (influence from sample | Python code or another language that uses "True"?), and Example | 1 doesn't discuss back pain. I don't know enough about GPT or | whatever other possible models may be getting discussed to know | whether "top_k/top_p" make sense, though temperature would seem | to. | qnxub wrote: | Is this the best way to create a chatbot with my personality? I | feel like I would want to fine tune some things so it is giving | real responses about my preferences, hobbies, etc. | | My use case is preserving my personality for loved ones after I | die. ___________________________________________________________________ (page generated 2020-01-23 23:00 UTC)