[HN Gopher] Mastering Real-Time Strategy Games with Deep RL: Mer...
       ___________________________________________________________________
        
       Mastering Real-Time Strategy Games with Deep RL: Mere Mortal
       Edition
        
       Author : cwinter
       Score  : 84 points
       Date   : 2021-03-24 15:05 UTC (7 hours ago)
        
 (HTM) web link (clemenswinter.com)
 (TXT) w3m dump (clemenswinter.com)
        
       | FartyMcFarter wrote:
       | > This trend has culminated in the defeat of top human players in
       | the complex real-time strategy (RTS) games of DoTA 2 [1] and
       | StarCraft II [2] in 2019.
       | 
       | Not quite:
       | 
       | - OpenAI's DoTA 2 system wasn't playing the full game. I think
       | the final version could play 17 of the 117 heroes, and the
       | opposing human players were also restricted to playing this
       | subset of the game.
       | 
       | - DeepMind's StarCraft II system reached a level above "above
       | 99.8% of officially ranked human players.", so it isn't trivial
       | to argue that this amounts to defeating top players.
        
         | kevinwang wrote:
         | < OpenAI's DoTA 2 system wasn't playing the full game. I think
         | the final version could play 17 of the 117 heroes, and the
         | opposing human players were also restricted to playing this
         | subset of the game.
         | 
         | The bigger issue in my eyes was that while OpenAI 5 defeated
         | the world champion team OG, when they let anyone in the world
         | fight it, some ingenious players figured out a pretty robust
         | method to consistently exploit and defeat the bot. As I haven't
         | heard any buzz about OpenAI 5 since then, I think it was more
         | or less unsuccessful unless they can show that their training
         | method produces unexploitable bots (instead of bots that are
         | really good against certain strategies)
        
           | Gunax wrote:
           | They can train the bots based on on those games though,
           | right? Seems more like a flaw in the training data than the
           | principle.
           | 
           | I am not sure if the training is done live or not--that is
           | does the algorithm learn based on each game against a real,
           | live player? Or do they just train the model offline, then
           | allow players to play against the static model?
        
             | kevinwang wrote:
             | > They can train the bots based on on those games though,
             | right? Seems more like a flaw in the training data than the
             | principle.
             | 
             | I guess you could phrase it that way, but that's
             | essentially the problem statement for developing a strategy
             | for an imperfect-information game. So I would say it is a
             | flaw in the principle if their final output is exploitable.
        
             | CyberRage wrote:
             | Training requires millions of games. playing against humans
             | is only for evaluation purposes, not for training.
             | 
             | In both cases, it was indeed a static model but more recent
             | work which is called MuZero is not static and achieves
             | great results in board games and atari.
        
             | ZephyrBlu wrote:
             | Assuming the OpenAI model is similar to DeepMind's
             | AlphaStar model, the model is static.
             | 
             | And a few games of being exploited is nowhere near enough
             | data for the AI to be re-trained.
        
             | jrumbut wrote:
             | Sibling posters have pointed out technical issues but I'd
             | like to point out that while chasing after each successive
             | exploit might let them stay on top of the Starcraft
             | rankings, it changes the accomplishment from "we made a
             | model that understands Starcraft better than almost any
             | human" to "we made a model that memorized a selection of
             | very good strategies and tactics."
             | 
             | When it was first demonstrated, it really looked like it
             | was doing very smart things (while also taking advantage of
             | the fact that it doesn't have attention lapses and hand
             | fatigue) and reacting well to different strategies on a
             | level that was freaky to me.
        
         | taberiand wrote:
         | I think this is kind of pedantic. They built an AI agent to
         | take pixel data as input, and provide mouse movements and
         | clicks as output, and rather than just flail around like a baby
         | it actually played the games with a sophisticated competency.
         | This to me is such an incredible achievement that I have no
         | doubt that it could be enhanced to defeat top players
         | consistently and easily.
         | 
         | As another commenter remarks, there are holes to plug in terms
         | of exploitable behaviours that are locked into the model, but
         | this too I'm confident they will find a general method of
         | preventing; on the other hand, it's not like humans aren't
         | susceptible to similar exploits by competitors in situations
         | where they decide to cease innovation/learning
        
           | TaupeRanger wrote:
           | Not at all. It is a computer, of course it beats humans at
           | optimization problems and speed. Not remotely surprising or
           | interesting, any more than a calculator doing arithmetic
           | faster than a human.
        
           | ZephyrBlu wrote:
           | > _As another commenter remarks, there are holes to plug in
           | terms of exploitable behaviours that are locked into the
           | model, but this too I 'm confident they will find a general
           | method of preventing_
           | 
           | The problem is, I don't think there is a "general method of
           | [prevention]" because that's not how neural networks work.
           | 
           | It's not easy to fix things like this because you can't just
           | say "yeah just don't do that dumb thing anymore", the network
           | has to be re-trained to learn the exploit.
           | 
           | The way DeepMind tried to get around this is by having a
           | league of AIs playing against each other which try to exploit
           | each other and expose their weaknesses. It worked pretty damn
           | well, but people still found ways to exploit the AI.
        
         | wnevets wrote:
         | > OpenAI's DoTA 2 system wasn't playing the full game. I think
         | the final version could play 17 of the 117 heroes
         | 
         | Limiting the number of playable heros in DoTA2 really isn't
         | important when it comes to evaluating the skill of the AI. Most
         | real players trying hard to win already play with a limited
         | hero pool dictated by the current patch verison.
        
           | alach11 wrote:
           | It is important. They removed a lot of champions with
           | complicated mechanics that could have been much harder for
           | the AI to play against.
        
             | wnevets wrote:
             | As someone who has played HoN & DoTA2 for over a decade I'm
             | telling you it isn't important _when_ evaluating the
             | ability of an AI to actually play the game.
             | 
             | Drafting can be massive when deciding the outcome of games
             | even at the lower skill levels. Opening up the entire hero
             | pool just means you're largely evaluating the ability to
             | draft in the current patch more than actual playing
             | ability.
        
         | porphyra wrote:
         | At Blizzcon 2019, Alphastar beat Serral, undeniably a top
         | player (although he didn't get to use his own keyboard and
         | settings or get to prepare). Serral was able to beat the Terran
         | agent though.
         | 
         | https://www.youtube.com/watch?v=nbiVbd_CEIA
        
           | ZephyrBlu wrote:
           | People also cheesed the shit out the bot and won though. None
           | of these AIs have proved to be robust to exploitation yet.
        
             | CyberRage wrote:
             | StarCraft is built in such a way that you can't create a
             | perfect, 100% winrate agent.
             | 
             | Since there is hidden information, you could always miss a
             | corner of the map where the enemy hidden some units and you
             | lose the game.
             | 
             | Is Alphastar "perfect"? no. Is it better than 99.9% of all
             | humans? absolutely.
             | 
             | You don't need to create a perfect agent in most cases,
             | self driving a classic example.
             | 
             | If you were to deploy an agent that drives 95% better than
             | all humans the effects would be huge.
             | 
             | It would still fail in some scenarios where professional
             | drivers won't be it doesn't really matter because most
             | people are not that.
        
               | Dylan16807 wrote:
               | > StarCraft is built in such a way that you can't create
               | a perfect, 100% winrate agent.
               | 
               | The worry isn't about perfect winrate, it's about finding
               | strategies that can _consistently_ cause the AI to lose
               | over and over.
               | 
               | In a cooperative environment, a high percentage is great.
               | 
               | In a competitive environment, that .1% of scenarios where
               | it's really weak will suddenly become the majority of
               | games it faces.
        
               | [deleted]
        
               | ZephyrBlu wrote:
               | I know the bot will never have 100% winrate, but I think
               | it shouldn't be able to be exploited (I.e. repeatedly
               | beaten using the same strategy).
               | 
               | Let me give you an example [0]. When AlphaStar was
               | playing on the ladder a player in Diamond league
               | (~70-80th percentile) beat AlphaStar easily using mass
               | Ravens. If you're not aware of the strategy, it's a
               | turtle strategy where the player masses air units and is
               | generally terrible.
               | 
               | But AlphaStar was confused by the strategy, and so it
               | lost by a large margin.
               | 
               | Deploying an AI which can be exploited like this is
               | asking for trouble.
               | 
               | [0] https://www.reddit.com/r/starcraft/comments/cgzieq/al
               | phastar...
        
               | neatze wrote:
               | by any chance, do you know replay ID of this game ?
        
               | CyberRage wrote:
               | https://www.youtube.com/watch?v=Di-yRj6TIK8
        
               | CyberRage wrote:
               | But that could be fixed technically. Deepmind's goal was
               | not to create an "unexploitable" agent but to prove that
               | ML algorithms can cope with complex, dynamic environments
               | such as StarCraft.
               | 
               | It seems to you weird but the same agent probably wins
               | against GM's most of the time. humans have weaknesses
               | too.
               | 
               | The AI simply leans on its strengths just like humans do.
        
               | ZephyrBlu wrote:
               | > _It seems to you weird but the same agent probably wins
               | against GM 's most of the time. humans have weaknesses
               | too_
               | 
               | This is the whole problem though. AlphaStar beats GMs but
               | can lose to weird strategies.
               | 
               | On the other hand, GMs will almost never lose (Most
               | likely >99% winrate) to a Diamond player no matter how
               | weird their strategies are.
               | 
               | The AI has strengths, but it also has _glaring_
               | weaknesses. Imagine if you had an AI flying a plane and
               | 99% of the time it was far better than a human pilot but
               | 1% of the time it crashed and killed everyone. I would
               | not fly on that plane.
               | 
               | Maybe a bunch more training data and time would solve
               | this type of problem, but I'm skeptical.
        
               | CyberRage wrote:
               | You're beautifully showing the human nature which can be
               | problematic in my opinion.
               | 
               | First of, no human player achieves 99% winrate against
               | diamond players. there are many cheeses, one miss-step
               | and you lose. GM's can lose to Diamond players.
               | 
               | Now for the main part, you're saying and I'm rephrasing
               | here:
               | 
               | Even if the AI is statistically better than humans
               | because it has some weaknesses I'm going to prefer the
               | human.
               | 
               | But still at the end of the day, the AI does a better job
               | on average and will be safer to use than human pilots!
               | 
               | We already heavily rely on software\algorithms for our
               | most important things. all modern vehicles use electronic
               | systems that monitor\manage several key components, stock
               | market is heavily managed by bots.
               | 
               | If AI can do a significantly better job than human, I
               | would choose the AI, even if it behaves strange in that
               | 0.1% of cases. humans are not as reliable as you think.
        
               | ZephyrBlu wrote:
               | > _First of, no human player achieves 99% winrate against
               | diamond players. there are many cheeses, one miss-step
               | and you lose. GM 's can lose to Diamond players_
               | 
               | They definitely would. You underestimate the difference
               | in skill. Top players almost always beat other GM players
               | and maintain very high winrates in top GM.
               | 
               | See for yourself: https://www.nephest.com/sc2/?season=46&
               | queue=LOTV_1V1&team-t...
               | 
               | > _But still at the end of the day, the AI does a better
               | job on average and will be safer to use than human
               | pilots!_
               | 
               | I agree, but only if that 1% or 0.1% or whatever is not
               | exploitable by someone malicious.
        
               | CyberRage wrote:
               | The link includes players with vastly lower winrate and
               | players with high winrates but for extremely low number
               | of games.
               | 
               | We need sufficient quantities to claim 99% winrate, for
               | highly ranked players even with 200 games(which is still
               | a low number since a single loss can massively affect
               | results) are not even close to 80% winrate. probably with
               | enough games it will be even lower.
               | 
               | Maintaining 99% winrate is extremely hard as you can only
               | lose a single game out of 100. people get tired, try new
               | stuff, simply don't pay attention or just get caught off
               | guard by a new thing.
               | 
               | As for "malicious exploitation", it does poses a risk in
               | some environments but the question then becomes exactly
               | the same.
               | 
               | Is the AI less exploitable than the average person?
               | 
               | If so, it doesn't matter.
        
               | ZephyrBlu wrote:
               | > _Is the AI less exploitable than the average person?_
               | 
               | People are generally not exploitable in the same way an
               | AI is because we can subjectively assess situations and
               | learn on the fly.
               | 
               | This is a good example of why I think your argument
               | doesn't hold water:
               | https://twitter.com/nikitabier/status/1372726911105855488
               | 
               | On the 99% winrate, I feel like you're either being
               | purposefully obtuse or have no experience with
               | competitive games.
               | 
               | Majority of the winrates are >70%, but even 60% is insane
               | for a competitive game _especially_ at the very highest
               | level. It is ridiculously hard to maintain a winrate this
               | high even over 30 games.
               | 
               | You seem to be thinking about this from a statistical
               | perspective (I.e. moar samples) without realizing that
               | this is baked into MMR (You're matched with opponents as
               | close to your skill level as possible). These players
               | _have_ to maintain high winrates just to _stay_ at this
               | MMR because they can earn as low as literally 0 MMR for a
               | win and lose up to 60 MMR for a loss.
               | 
               | These players are also around 3000 MMR higher than
               | Diamond players. Using the Elo model [0], this equates to
               | a 99.998% winrate.
               | 
               | 100 games in a row is also not feasible. That's ~20 hours
               | of playtime assuming 12min games.
               | 
               | [0] https://www.reddit.com/r/starcraft/comments/7fc30w/7_
               | orders_...
        
               | neatze wrote:
               | It is false that AlphaStar learns like humans do.
        
         | exdsq wrote:
         | The difference between the top 0.2% and top 500 players is huge
         | too
        
       | andyljones wrote:
       | In case anyone misses the links, this is twinned with two other
       | superb posts - one about general lessons the author learned over
       | the course of the project
       | 
       | https://clemenswinter.com/2021/03/24/my-reinforcement-learni...
       | 
       | and one history of the project
       | 
       | https://clemenswinter.com/2021/03/24/conjuring-a-codecraft-m...
        
       | Yenrabbit wrote:
       | This is an excellent project with a great write-up. Most articles
       | this long would loose me but this is engaging and clear, a joy to
       | read. And I'm in awe of the amount of work that has gone into
       | every aspect of this.
       | 
       | >Seeing as my policies are currently the world's best CodeCraft
       | players, we'll just have to take their word for it for the time
       | being.
       | 
       | I really hope this inspires some competition! How long until
       | there is a leaderboard? :)
        
         | shmageggy wrote:
         | Agreed, this is better than the vast majority of machine
         | learning papers that actually get published. The ablation
         | section is particularly nice. It is really a major failing of
         | the field that in most papers, it's entirely unclear what
         | aspect of the model (or which particular hacks) are really
         | carrying the weight.
        
       | mindfulplay wrote:
       | This is a fantastic project and a great blog! As games start to
       | include RL, it will be a lot of fun that could spawn a while new
       | generation of interesting games (especially if games are made
       | with an RL-first mindset as opposed to using RL later on to beat
       | human beings).
       | 
       | Do you have recommendations to learn more about RL? Is CodeCraft
       | a game?
        
         | cwinter wrote:
         | Thank you for the kind words! I am also quite excited about the
         | new points in game design space that RL will unlock and am
         | planning write another blogpost on that topic.
         | 
         | I quite like https://karpathy.github.io/2016/05/31/rl/ as an
         | introduction to some of the ideas behind modern RL. Beyond
         | that, I just recently found out about
         | https://github.com/andyljones/reinforcement-learning-discord...
         | which lists a lot of other high-quality resources.
         | 
         | CodeCraft is a programming game which you can "play" by writing
         | a Scala/Java program that controls the game units. It's not
         | actively developed anymore but still functional:
         | http://codecraftgame.org/
        
       | CyberRage wrote:
       | Interesting blog-post.
       | 
       | I found some similarities with what occurred with Deepmind's
       | Alphastar AI.
       | 
       | One of the weaknesses that seem to manifest in this piece too is
       | the handling of unfamiliar scenarios.
       | 
       | The AI is very confused once it experiences something that was
       | rarely seen in its learning data. Destroyer's big drones confused
       | the bot quite a bit.
       | 
       | Deepmind solved it by intentionally creating agents that
       | introduce different\bizzare strategies(which they called
       | exploiters) in order to develop robustness against such
       | strategies.
        
         | cwinter wrote:
         | The bot has actually never seen Destroyer's big drones during
         | training even once, so I found it somewhat surprising that it
         | even works as well as it does!
         | 
         | Completely agree that adding something like the "League" used
         | by AlphaStar would be one of the top priorities if you wanted
         | to push this project further. I don't think CodeCraft is
         | sufficiently complex to really allow for several very distinct
         | strategies in the same way as StarCraft II, but I would still
         | expect training against a larger pool of more diverse agents to
         | increase robustness quite a bit.
        
           | CyberRage wrote:
           | What amazes me at the end of the day is that brute-forcing
           | seem to do much better than I initially thought it would do.
           | 
           | Trying random stuff just sounds stupid but with enough
           | compute and data, I guess it could overpower smart creatures
           | like us.
           | 
           | I agree that CodeCraft is vastly simpler than StarCraft but
           | the idea is the same. just try random stuff(sometimes with
           | better logic behind it) until something works and then
           | optimize it to perfection.
        
       ___________________________________________________________________
       (page generated 2021-03-24 23:00 UTC)