[HN Gopher] Fast implementation of DeepMind's AlphaZero algorith...
       ___________________________________________________________________
        
       Fast implementation of DeepMind's AlphaZero algorithm in Julia
        
       Author : metalwhale
       Score  : 247 points
       Date   : 2020-06-22 12:04 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | tromp wrote:
       | The implementation includes Connect Four as an example
       | application. While the standard board size of 7x6 is indeed
       | solved, as they note, and in fact all sizes up to 8x8 are [1],
       | they could have picked 9x8 or 9x9 which are currently unsolved.
       | The latter is the new standard size on Little Golem which
       | upgraded from 8x8 when that was solved.
       | 
       | [1] https://tromp.github.io/c4/c4.html
       | 
       | [2] http://www.littlegolem.net/jsp/games/gamedetail.jsp? gtid=fir
       | 
       | [3]
       | http://www.littlegolem.net/jsp/forum/topic2.jsp?forum=80&top...
        
         | jonath_laurent wrote:
         | I completely agree with you. Let me just add two remarks.
         | First, although picking 9x9 boards makes connect-four
         | intractable for bruteforce search indeed, I would be suprised
         | if it made it much more difficult for AlphaZero, which relies
         | on the generalization capabilities of the network anyway.
         | Second, using a solved game for the tutorial is a feature, not
         | a bug. This allows precise benchmarking of the resulting agent
         | as a ground truth is known.
        
           | tromp wrote:
           | I did not see an evaluation of how close to perfection the
           | agent becomes. Did you compute any sort of error rate (by
           | finding moves that turn a won position into a non-won one or
           | a drawn position into a lost one) ? And how this error rate
           | drops over time as learning advances? That would indeed be
           | very interesting to see.
        
             | jonath_laurent wrote:
             | Such an evaluation is available in the tutorial:
             | https://jonathan-
             | laurent.github.io/AlphaZero.jl/dev/tutorial...
             | 
             | Admittedly, the connect four agent is still far from
             | perfect but there is a lot of margin for improvement as I
             | have done very little hyperparameters tuning so far.
        
             | vishvananda wrote:
             | My team did an implementation of alpha zero connect four a
             | couple of years ago. Our findings are in a series of blog
             | posts starting at https://medium.com/oracledevs/lessons-
             | from-implementing-alph.... We didn't manage to get to
             | perfection either on policy, but got pretty close. You can
             | play against some versions of the network here:
             | https://azfour.com
        
               | jonath_laurent wrote:
               | Your series of blog articles has been an important source
               | of inspiration in writing AlphaZero.jl and I cite it
               | frequently in the documentation. Thanks to you and your
               | team!
        
           | dnautics wrote:
           | That's really cool and I didn't think of that. I just wanted
           | clarification: that means you train the agent _without_ the
           | deterministic solution and your  "validation/test" (I'm not
           | sure what those phases are called in unsupervised learning)
           | sets are done without the deterministic solution.
        
             | jonath_laurent wrote:
             | Yes, the agent is trained without access to the
             | deterministic solution.
        
       | mtgp1000 wrote:
       | I don't know anything about Julia...how hard would this be to
       | port to python or a c-style language?
       | 
       | Edit: I was mainly asking because I was curious about the
       | relative expressiveness Julia...
        
         | ViralBShah wrote:
         | I was going through this project over the weekend. And while I
         | can't recall where exactly in the docs I read this, I am quite
         | sure the author mentioned that there are various python
         | projects but they are quite slow. Other implementations such as
         | leela chess zero have a lot of C++ and are difficult to follow.
         | 
         | In fact, one of the things we want to do is maximize the
         | performance of the Julia implementation. We hope to co-develop
         | the compiler and ML stack to address these issues as they come
         | up.
        
           | newswasboring wrote:
           | I don't know if you saw it here, but a similar point is made
           | in the readme section "Why should I care about this
           | implementation".[1]
           | 
           | https://github.com/jonathan-laurent/AlphaZero.jl#why-
           | should-...
        
           | doublesCs wrote:
           | Truly truly thank you for your work <3
        
             | newswasboring wrote:
             | Not sure you are thanking jonath_laurent (original author
             | of package of discussion) or ViralBShah (co-creator of
             | Julia). But I concur on both accounts :D
        
               | doublesCs wrote:
               | Viral. I thanked Jonath in a different post :-P
               | 
               | When I see projects like this (I mean especially Julia,
               | but also people sharing their work on packages like this)
               | I feel very fortunate that elements of the free software
               | movement are still alive.
        
         | adamnemecek wrote:
         | Why would you? Julia shines in this use case.
        
         | instance wrote:
         | There are already implementations out there in Python. [1]
         | 
         | The point of that project is to be a very fast alternative to
         | those implementations while being more accessible than a C++
         | implementation.
         | 
         | [1] https://github.com/suragnair/alpha-zero-general
        
       | jonath_laurent wrote:
       | Author here: I am happy to answer any question you may have about
       | AlphaZero.jl. :-)
        
         | MaxBarraclough wrote:
         | Hi, thanks for this great project.
         | 
         | Connect Four was used as a demonstration. I presume this is
         | because it's much easier/cheaper to train a Connect Four AI,
         | compared to Go?
        
           | jonath_laurent wrote:
           | Yes. Go 19x19 would be completely intractable on a single
           | machine (one comment is citing a $25 million cost estimate in
           | computing power to train AlphaGo Zero). A more reasonable
           | target would be Go 9x9 but even this would be an extreme
           | challenge on a single machine.
           | 
           | There is an Oracle blog article series about training a
           | close-to-perfect Connect Four player using AlphaZero. Even
           | here, they had to rely on multiple GPUs.
           | 
           | You have to keep in mind that AlphaZero is an extremely
           | sample-inefficient learning technique, even for simple
           | problems. Rather, the strengths of this algorithm is that 1)
           | it is pretty generic and 2) it can leverage huge amounts of
           | computation.
        
         | doublesCs wrote:
         | No questions. Just wanted to thank you for sharing. People like
         | you make the world better one tiny bit at a time.
        
           | jonath_laurent wrote:
           | Thanks for your kind message.
        
         | vishvananda wrote:
         | Nice work on this! I was behind the implementation at oracle
         | which you referenced in the tutorial. I still keep tabs on the
         | lc0 crowd which seems to be pushing into new ideas. Did you
         | pull anything else from the leela crowd besides prior-
         | temperature? It looks like maybe you also tried a WLD output
         | head as well?
        
           | jonath_laurent wrote:
           | What do you mean by WLD output head?
           | 
           | So far, the main idea I have pulled from the Lc0 crowd is to
           | have a prior temperature indeed. The next thing I am planning
           | to add is the possibility to batch inference requests across
           | game simulations instead of relying on asynchronous MCTS. In
           | your blog series, you anticipate the problem of the virtual
           | loss introducing some exploration bias in the search but
           | ultimately concludes that it does not change much:
           | 
           | [Citation from your blog series]: "Technically, virtual loss
           | adds some degree of exploration to game playouts, as it
           | forces move selection down paths that MCTS may not naturally
           | be inclined to visit, but we never measured any detrimental
           | (or beneficial) effect due to its use."
           | 
           | Interestingly, it seems that the LC0 team had a different
           | experience here. I myself ran some tests and going from 32 to
           | 4 workers (for 600 MCTS simulations per turn) on my connect-
           | four agent results in a significant increase in performances.
           | This may be due to the fact that I use a much smaller neural
           | network than yours, which is ultimately not as strong.
           | 
           | Related to this, there is a question I have wanted to ask you
           | since I found your blog article series: did you make
           | experiments with smaller networks and what were the results?
           | What is the smallest architecture you tried and how did it
           | perform?
        
             | vishvananda wrote:
             | The lc0 group has switched the result prediction to predict
             | win, loss, and draw probabilities instead of just win/loss.
             | Some information can be found in
             | https://lczero.org/blog/2020/04/wdl-head/
        
             | vishvananda wrote:
             | we did a lot of our early experimentation with small
             | networks. I don't think we went any smaller than 5 layers
             | of 64 filters as we mentioned here:
             | https://medium.com/oracledevs/lessons-from-alpha-zero-
             | part-5...
        
               | jonath_laurent wrote:
               | And what were the results of these experiments? What
               | error rate can you reach with the smallest network
               | architecture you tried for example?
        
               | vishvananda wrote:
               | Unfortunately I don't remember the exact numbers, but I
               | think it was a couple percentage points worse than we
               | were able to get with the large models.
        
         | patagurbon wrote:
         | Do you have any thoughts about multi GPU training? I haven't
         | seen many options for Flux previously, but didn't dig very
         | much.
        
           | jonath_laurent wrote:
           | Multiple GPUs support definitely belongs to the TODO list.
           | However, I am currently limited by the state of CUDA.jl on
           | this, as it does not have a device-aware memory pool yet.
           | 
           | I am also looking forward to CUDA.jl supporting f16 and int8
           | computations, which may enable another big speedup.
        
       | tbenst wrote:
       | First of all this is very cool. Dunno if author is on here, but
       | I'm curious why both Flux and Knet are used rather than just one
       | of them (Flux seems the most Julianic?).
       | 
       | Also, is this really faster than PyTorch/TF? Last time I
       | benchmarked Flux for non-trivial networks, the speed was quite
       | good with small models but memory usage was ~5x higher than
       | pytorch, and I couldn't fit my models on the GPU for flux. For
       | large models, I had to compromise on batch size in Julia,
       | although maybe with Zygote.jl the memory issues have been
       | resolved?
        
         | jonath_laurent wrote:
         | Author here. AlphaZero.jl supports both Flux and Knet indeed
         | and users can choose whatever framework they want to use.
         | 
         | As far as I understand, Flux and Knet have different strengths.
         | I think Knet is a bit more stable and mature for large-scale
         | Deep Learning, but Flux shines for "scientific-ML" usecases
         | where low AD overhead is crucial.
        
         | jonath_laurent wrote:
         | I suspect FLux/Knet are still slightly slower and less memory
         | efficient than PyTorch/TF, although things are moving very fast
         | here!
         | 
         | This is not relevant in understanding AlphaZero.jl speed
         | though. The reason it is much faster than Python
         | implementations is because tree search is also a bottleneck,
         | and Julia shines here!
        
           | tbenst wrote:
           | Ah, I hadn't appreciated this. Thanks for making & sharing
           | your code!
        
         | ViralBShah wrote:
         | While some may be addressed and others are being addressed,
         | what would really help us if people file issues when they don't
         | find performance to be adequate. If you still have the code
         | handy, please do open some issues.
        
         | dunefox wrote:
         | > 5x higher than pytorch, and I couldn't fit my models on the
         | GPU for flux. For large models, I had to compromise on batch
         | size in Julia
         | 
         | I had the exact same experience. While I like Julia and Flux I
         | can't use it in this state for my models.
        
           | dklend122 wrote:
           | Would you mind opening corresponding issues on the repo? That
           | would help guide the ongoing compiler work.
        
       | metalwhale wrote:
       | Disclaimer: I'm not the author. Just want to share this awesome
       | project.
        
       | likeaj6 wrote:
       | This is awesome! I worked on a similar project in the past for
       | the game Hex
       | 
       | Did a writeup here about it:
       | https://notes.jasonljin.com/projects/2018/05/20/Training-Alp...
       | 
       | https://github.com/likeaj6/alphazero-hex
        
         | jonath_laurent wrote:
         | Actually, I found your blog article when I was reading about
         | AlphaZero and I found it useful!
        
       | FiberBundle wrote:
       | Does anybody know how long it would take to train an alphazero go
       | version using one gpu? In [1] they claim that it took 13 hours
       | until the model was able to beat the original alphago version,
       | but they don't state what hardware they used.
       | 
       | [1] https://deepmind.com/blog/article/alphazero-shedding-new-
       | lig...
        
         | wrsh07 wrote:
         | That was with at least one or more tpu pods, iirc
         | 
         | https://cloud.google.com/tpu/docs/system-architecture
        
         | arijun wrote:
         | I can't find it now but iirc there was a blog post on HN about
         | a month ago that estimated their training costs at $25 million,
         | using many TPU pods.
        
           | cgreerrun wrote:
           | Here was the guestimation: https://www.yuzeh.com/data/agz-
           | cost.html
        
         | jonath_laurent wrote:
         | I agree with the quoted numbers. As I mentioned in another
         | comment, you have to keep in mind that AlphaZero is an
         | extremely sample-inefficient learning technique, even for
         | simple problems. However, it has two major strengths: 1) it is
         | pretty generic and 2) it can leverage huge amounts of computing
         | power.
        
         | newswasboring wrote:
         | From an offline chat with the original author,
         | 
         | The ELF OpenGo paper[1], which is an open implementation of
         | AlphaGo Zero developed by Facebook AI:
         | 
         | "First, we train a superhuman model for ELF OpenGo. Af-ter
         | running our AlphaZero-style training software on 2,000GPUs for
         | 9 days, our 20-block model has achieved super-human performance
         | that is arguably comparable to the 20-block models described in
         | Silver et al. (2017) and Silveret al. (2018)."
         | 
         | [1]: https://arxiv.org/pdf/1902.04522.pdf
        
       | cgreerrun wrote:
       | I've been working on a Python implementation that uses Gradient
       | Boosted Decision Trees (LightGBM/Treelite) instead of using a
       | neural network for the value/policy models:
       | 
       | https://github.com/cgreer/alpha-zero-boosted
       | 
       | It's mostly to understand how AlphaZero&Friends work. I'm also
       | curious about how well a GBDT could do, and if there are self-
       | play techniques that can accelerate training.
       | 
       | The nice thing about a GBDT is that, unlike when using a NN, you
       | can do thousands of value/policy lookups per second on a single
       | core. So it should be cheaper to scale self-play and run a lot of
       | self-play experiments (assuming the self-play learnings when
       | using the GBDT model transfer to when you use the more-powerful
       | NN in these environments).
       | 
       | If you're curious about accelerating self-play training, check
       | out David Wu's work (https://arxiv.org/pdf/1902.10565.pdf). He's
       | the creator of KataGo. I implemented his "Playout Cap
       | Randomization" technique in my implementation above and, sure
       | enough, it's much more efficient: https://imgur.com/a/epaKtDY. It
       | seems like it's still early days in terms of how efficient self-
       | play training is.
        
         | jorgemf wrote:
         | how good is your AI so far?
        
         | psandersen wrote:
         | This is really interesting, thanks for sharing!
         | 
         | I've been thinking about extensions to decision tree models
         | that could get the benefits of NNs and it seems like there are
         | a few ideas floating around.
         | 
         | For example; Probabilistic Random Forests have some really
         | interesting properties for noisy datasets, e.g. "The PRF
         | accuracy decreased by less then 5% for a dataset with as many
         | as 45% misclassified objects, compared to a clean dataset." -
         | https://arxiv.org/abs/1811.05994
         | 
         | PRF's might be a natural fit for RL, especially methods using
         | monte carlo tree search.
         | 
         | Speculating here as I'm not adequately familiar with stochastic
         | calculus, but intuitively it seems like probabilistic decision
         | trees could be made differentiable since the hard decision
         | threshold in a tree could be a turned continuous (i.e. every
         | split is logistic regression), which might enable some really
         | interesting applications. I personally dream of being able to
         | cleanly integrate decision trees and tools from NNs in
         | something like pyro for a fully Bayesian model.
        
           | credit_guy wrote:
           | There appears to be a LogitBoost tree that does what you say,
           | if I understand you correctly.
           | 
           | [1] https://en.wikipedia.org/wiki/LogitBoost
        
             | psandersen wrote:
             | Thanks for the reference, from a quick read it seems
             | LogitBoost is a booster for ensembling models under a
             | logistic loss.
             | 
             | I meant that the splitting point in a node in the decision
             | trees that make up a random forest is itself a random
             | variable that follows a distribution. Because its a smooth
             | function (i.e. probability of splitting is 50% at the point
             | and rises/falls smoothly) it should in principle be
             | differentiable* so that the whole model can be trained by
             | SGD and/or fit into an end to end learning pipeline with
             | convolutional layers etc.
             | 
             | *Where I'm hazy, is how can a smooth probability function
             | be differentiable when sampling. I'm brainstorming in the
             | open here, will do some reading on stochastic neural
             | networks.
        
         | algo_trader wrote:
         | Its great when you can indeed iterate without 100s of GPU/hrs.
         | 
         | Are there any papers/comparisons/tradeoffs on when GBDT
         | predictive-power plateaus compared to a NN?
         | 
         | EDIT: with self play you can trade-off a cpu budget for both
         | the GBDT depth, a NN depth, and the roll-out depth - which is
         | super interesting
        
         | jonath_laurent wrote:
         | This is very interesting! If your experiments work out, I would
         | be interested in adding "Gradient Boosted Decision Trees"
         | support to AlphaZero.jl.
         | 
         | I saw the work on KataGo and implementing "Playout Cap
         | Randomization" is indeed on my TODO list.
        
       ___________________________________________________________________
       (page generated 2020-06-22 23:00 UTC)