[HN Gopher] Fast implementation of DeepMind's AlphaZero algorith... ___________________________________________________________________ Fast implementation of DeepMind's AlphaZero algorithm in Julia Author : metalwhale Score : 247 points Date : 2020-06-22 12:04 UTC (10 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | tromp wrote: | The implementation includes Connect Four as an example | application. While the standard board size of 7x6 is indeed | solved, as they note, and in fact all sizes up to 8x8 are [1], | they could have picked 9x8 or 9x9 which are currently unsolved. | The latter is the new standard size on Little Golem which | upgraded from 8x8 when that was solved. | | [1] https://tromp.github.io/c4/c4.html | | [2] http://www.littlegolem.net/jsp/games/gamedetail.jsp? gtid=fir | | [3] | http://www.littlegolem.net/jsp/forum/topic2.jsp?forum=80&top... | jonath_laurent wrote: | I completely agree with you. Let me just add two remarks. | First, although picking 9x9 boards makes connect-four | intractable for bruteforce search indeed, I would be suprised | if it made it much more difficult for AlphaZero, which relies | on the generalization capabilities of the network anyway. | Second, using a solved game for the tutorial is a feature, not | a bug. This allows precise benchmarking of the resulting agent | as a ground truth is known. | tromp wrote: | I did not see an evaluation of how close to perfection the | agent becomes. Did you compute any sort of error rate (by | finding moves that turn a won position into a non-won one or | a drawn position into a lost one) ? And how this error rate | drops over time as learning advances? That would indeed be | very interesting to see. | jonath_laurent wrote: | Such an evaluation is available in the tutorial: | https://jonathan- | laurent.github.io/AlphaZero.jl/dev/tutorial... | | Admittedly, the connect four agent is still far from | perfect but there is a lot of margin for improvement as I | have done very little hyperparameters tuning so far. | vishvananda wrote: | My team did an implementation of alpha zero connect four a | couple of years ago. Our findings are in a series of blog | posts starting at https://medium.com/oracledevs/lessons- | from-implementing-alph.... We didn't manage to get to | perfection either on policy, but got pretty close. You can | play against some versions of the network here: | https://azfour.com | jonath_laurent wrote: | Your series of blog articles has been an important source | of inspiration in writing AlphaZero.jl and I cite it | frequently in the documentation. Thanks to you and your | team! | dnautics wrote: | That's really cool and I didn't think of that. I just wanted | clarification: that means you train the agent _without_ the | deterministic solution and your "validation/test" (I'm not | sure what those phases are called in unsupervised learning) | sets are done without the deterministic solution. | jonath_laurent wrote: | Yes, the agent is trained without access to the | deterministic solution. | mtgp1000 wrote: | I don't know anything about Julia...how hard would this be to | port to python or a c-style language? | | Edit: I was mainly asking because I was curious about the | relative expressiveness Julia... | ViralBShah wrote: | I was going through this project over the weekend. And while I | can't recall where exactly in the docs I read this, I am quite | sure the author mentioned that there are various python | projects but they are quite slow. Other implementations such as | leela chess zero have a lot of C++ and are difficult to follow. | | In fact, one of the things we want to do is maximize the | performance of the Julia implementation. We hope to co-develop | the compiler and ML stack to address these issues as they come | up. | newswasboring wrote: | I don't know if you saw it here, but a similar point is made | in the readme section "Why should I care about this | implementation".[1] | | https://github.com/jonathan-laurent/AlphaZero.jl#why- | should-... | doublesCs wrote: | Truly truly thank you for your work <3 | newswasboring wrote: | Not sure you are thanking jonath_laurent (original author | of package of discussion) or ViralBShah (co-creator of | Julia). But I concur on both accounts :D | doublesCs wrote: | Viral. I thanked Jonath in a different post :-P | | When I see projects like this (I mean especially Julia, | but also people sharing their work on packages like this) | I feel very fortunate that elements of the free software | movement are still alive. | adamnemecek wrote: | Why would you? Julia shines in this use case. | instance wrote: | There are already implementations out there in Python. [1] | | The point of that project is to be a very fast alternative to | those implementations while being more accessible than a C++ | implementation. | | [1] https://github.com/suragnair/alpha-zero-general | jonath_laurent wrote: | Author here: I am happy to answer any question you may have about | AlphaZero.jl. :-) | MaxBarraclough wrote: | Hi, thanks for this great project. | | Connect Four was used as a demonstration. I presume this is | because it's much easier/cheaper to train a Connect Four AI, | compared to Go? | jonath_laurent wrote: | Yes. Go 19x19 would be completely intractable on a single | machine (one comment is citing a $25 million cost estimate in | computing power to train AlphaGo Zero). A more reasonable | target would be Go 9x9 but even this would be an extreme | challenge on a single machine. | | There is an Oracle blog article series about training a | close-to-perfect Connect Four player using AlphaZero. Even | here, they had to rely on multiple GPUs. | | You have to keep in mind that AlphaZero is an extremely | sample-inefficient learning technique, even for simple | problems. Rather, the strengths of this algorithm is that 1) | it is pretty generic and 2) it can leverage huge amounts of | computation. | doublesCs wrote: | No questions. Just wanted to thank you for sharing. People like | you make the world better one tiny bit at a time. | jonath_laurent wrote: | Thanks for your kind message. | vishvananda wrote: | Nice work on this! I was behind the implementation at oracle | which you referenced in the tutorial. I still keep tabs on the | lc0 crowd which seems to be pushing into new ideas. Did you | pull anything else from the leela crowd besides prior- | temperature? It looks like maybe you also tried a WLD output | head as well? | jonath_laurent wrote: | What do you mean by WLD output head? | | So far, the main idea I have pulled from the Lc0 crowd is to | have a prior temperature indeed. The next thing I am planning | to add is the possibility to batch inference requests across | game simulations instead of relying on asynchronous MCTS. In | your blog series, you anticipate the problem of the virtual | loss introducing some exploration bias in the search but | ultimately concludes that it does not change much: | | [Citation from your blog series]: "Technically, virtual loss | adds some degree of exploration to game playouts, as it | forces move selection down paths that MCTS may not naturally | be inclined to visit, but we never measured any detrimental | (or beneficial) effect due to its use." | | Interestingly, it seems that the LC0 team had a different | experience here. I myself ran some tests and going from 32 to | 4 workers (for 600 MCTS simulations per turn) on my connect- | four agent results in a significant increase in performances. | This may be due to the fact that I use a much smaller neural | network than yours, which is ultimately not as strong. | | Related to this, there is a question I have wanted to ask you | since I found your blog article series: did you make | experiments with smaller networks and what were the results? | What is the smallest architecture you tried and how did it | perform? | vishvananda wrote: | The lc0 group has switched the result prediction to predict | win, loss, and draw probabilities instead of just win/loss. | Some information can be found in | https://lczero.org/blog/2020/04/wdl-head/ | vishvananda wrote: | we did a lot of our early experimentation with small | networks. I don't think we went any smaller than 5 layers | of 64 filters as we mentioned here: | https://medium.com/oracledevs/lessons-from-alpha-zero- | part-5... | jonath_laurent wrote: | And what were the results of these experiments? What | error rate can you reach with the smallest network | architecture you tried for example? | vishvananda wrote: | Unfortunately I don't remember the exact numbers, but I | think it was a couple percentage points worse than we | were able to get with the large models. | patagurbon wrote: | Do you have any thoughts about multi GPU training? I haven't | seen many options for Flux previously, but didn't dig very | much. | jonath_laurent wrote: | Multiple GPUs support definitely belongs to the TODO list. | However, I am currently limited by the state of CUDA.jl on | this, as it does not have a device-aware memory pool yet. | | I am also looking forward to CUDA.jl supporting f16 and int8 | computations, which may enable another big speedup. | tbenst wrote: | First of all this is very cool. Dunno if author is on here, but | I'm curious why both Flux and Knet are used rather than just one | of them (Flux seems the most Julianic?). | | Also, is this really faster than PyTorch/TF? Last time I | benchmarked Flux for non-trivial networks, the speed was quite | good with small models but memory usage was ~5x higher than | pytorch, and I couldn't fit my models on the GPU for flux. For | large models, I had to compromise on batch size in Julia, | although maybe with Zygote.jl the memory issues have been | resolved? | jonath_laurent wrote: | Author here. AlphaZero.jl supports both Flux and Knet indeed | and users can choose whatever framework they want to use. | | As far as I understand, Flux and Knet have different strengths. | I think Knet is a bit more stable and mature for large-scale | Deep Learning, but Flux shines for "scientific-ML" usecases | where low AD overhead is crucial. | jonath_laurent wrote: | I suspect FLux/Knet are still slightly slower and less memory | efficient than PyTorch/TF, although things are moving very fast | here! | | This is not relevant in understanding AlphaZero.jl speed | though. The reason it is much faster than Python | implementations is because tree search is also a bottleneck, | and Julia shines here! | tbenst wrote: | Ah, I hadn't appreciated this. Thanks for making & sharing | your code! | ViralBShah wrote: | While some may be addressed and others are being addressed, | what would really help us if people file issues when they don't | find performance to be adequate. If you still have the code | handy, please do open some issues. | dunefox wrote: | > 5x higher than pytorch, and I couldn't fit my models on the | GPU for flux. For large models, I had to compromise on batch | size in Julia | | I had the exact same experience. While I like Julia and Flux I | can't use it in this state for my models. | dklend122 wrote: | Would you mind opening corresponding issues on the repo? That | would help guide the ongoing compiler work. | metalwhale wrote: | Disclaimer: I'm not the author. Just want to share this awesome | project. | likeaj6 wrote: | This is awesome! I worked on a similar project in the past for | the game Hex | | Did a writeup here about it: | https://notes.jasonljin.com/projects/2018/05/20/Training-Alp... | | https://github.com/likeaj6/alphazero-hex | jonath_laurent wrote: | Actually, I found your blog article when I was reading about | AlphaZero and I found it useful! | FiberBundle wrote: | Does anybody know how long it would take to train an alphazero go | version using one gpu? In [1] they claim that it took 13 hours | until the model was able to beat the original alphago version, | but they don't state what hardware they used. | | [1] https://deepmind.com/blog/article/alphazero-shedding-new- | lig... | wrsh07 wrote: | That was with at least one or more tpu pods, iirc | | https://cloud.google.com/tpu/docs/system-architecture | arijun wrote: | I can't find it now but iirc there was a blog post on HN about | a month ago that estimated their training costs at $25 million, | using many TPU pods. | cgreerrun wrote: | Here was the guestimation: https://www.yuzeh.com/data/agz- | cost.html | jonath_laurent wrote: | I agree with the quoted numbers. As I mentioned in another | comment, you have to keep in mind that AlphaZero is an | extremely sample-inefficient learning technique, even for | simple problems. However, it has two major strengths: 1) it is | pretty generic and 2) it can leverage huge amounts of computing | power. | newswasboring wrote: | From an offline chat with the original author, | | The ELF OpenGo paper[1], which is an open implementation of | AlphaGo Zero developed by Facebook AI: | | "First, we train a superhuman model for ELF OpenGo. Af-ter | running our AlphaZero-style training software on 2,000GPUs for | 9 days, our 20-block model has achieved super-human performance | that is arguably comparable to the 20-block models described in | Silver et al. (2017) and Silveret al. (2018)." | | [1]: https://arxiv.org/pdf/1902.04522.pdf | cgreerrun wrote: | I've been working on a Python implementation that uses Gradient | Boosted Decision Trees (LightGBM/Treelite) instead of using a | neural network for the value/policy models: | | https://github.com/cgreer/alpha-zero-boosted | | It's mostly to understand how AlphaZero&Friends work. I'm also | curious about how well a GBDT could do, and if there are self- | play techniques that can accelerate training. | | The nice thing about a GBDT is that, unlike when using a NN, you | can do thousands of value/policy lookups per second on a single | core. So it should be cheaper to scale self-play and run a lot of | self-play experiments (assuming the self-play learnings when | using the GBDT model transfer to when you use the more-powerful | NN in these environments). | | If you're curious about accelerating self-play training, check | out David Wu's work (https://arxiv.org/pdf/1902.10565.pdf). He's | the creator of KataGo. I implemented his "Playout Cap | Randomization" technique in my implementation above and, sure | enough, it's much more efficient: https://imgur.com/a/epaKtDY. It | seems like it's still early days in terms of how efficient self- | play training is. | jorgemf wrote: | how good is your AI so far? | psandersen wrote: | This is really interesting, thanks for sharing! | | I've been thinking about extensions to decision tree models | that could get the benefits of NNs and it seems like there are | a few ideas floating around. | | For example; Probabilistic Random Forests have some really | interesting properties for noisy datasets, e.g. "The PRF | accuracy decreased by less then 5% for a dataset with as many | as 45% misclassified objects, compared to a clean dataset." - | https://arxiv.org/abs/1811.05994 | | PRF's might be a natural fit for RL, especially methods using | monte carlo tree search. | | Speculating here as I'm not adequately familiar with stochastic | calculus, but intuitively it seems like probabilistic decision | trees could be made differentiable since the hard decision | threshold in a tree could be a turned continuous (i.e. every | split is logistic regression), which might enable some really | interesting applications. I personally dream of being able to | cleanly integrate decision trees and tools from NNs in | something like pyro for a fully Bayesian model. | credit_guy wrote: | There appears to be a LogitBoost tree that does what you say, | if I understand you correctly. | | [1] https://en.wikipedia.org/wiki/LogitBoost | psandersen wrote: | Thanks for the reference, from a quick read it seems | LogitBoost is a booster for ensembling models under a | logistic loss. | | I meant that the splitting point in a node in the decision | trees that make up a random forest is itself a random | variable that follows a distribution. Because its a smooth | function (i.e. probability of splitting is 50% at the point | and rises/falls smoothly) it should in principle be | differentiable* so that the whole model can be trained by | SGD and/or fit into an end to end learning pipeline with | convolutional layers etc. | | *Where I'm hazy, is how can a smooth probability function | be differentiable when sampling. I'm brainstorming in the | open here, will do some reading on stochastic neural | networks. | algo_trader wrote: | Its great when you can indeed iterate without 100s of GPU/hrs. | | Are there any papers/comparisons/tradeoffs on when GBDT | predictive-power plateaus compared to a NN? | | EDIT: with self play you can trade-off a cpu budget for both | the GBDT depth, a NN depth, and the roll-out depth - which is | super interesting | jonath_laurent wrote: | This is very interesting! If your experiments work out, I would | be interested in adding "Gradient Boosted Decision Trees" | support to AlphaZero.jl. | | I saw the work on KataGo and implementing "Playout Cap | Randomization" is indeed on my TODO list. ___________________________________________________________________ (page generated 2020-06-22 23:00 UTC)