[HN Gopher] Mastering Real-Time Strategy Games with Deep RL: Mer... ___________________________________________________________________ Mastering Real-Time Strategy Games with Deep RL: Mere Mortal Edition Author : cwinter Score : 84 points Date : 2021-03-24 15:05 UTC (7 hours ago) (HTM) web link (clemenswinter.com) (TXT) w3m dump (clemenswinter.com) | FartyMcFarter wrote: | > This trend has culminated in the defeat of top human players in | the complex real-time strategy (RTS) games of DoTA 2 [1] and | StarCraft II [2] in 2019. | | Not quite: | | - OpenAI's DoTA 2 system wasn't playing the full game. I think | the final version could play 17 of the 117 heroes, and the | opposing human players were also restricted to playing this | subset of the game. | | - DeepMind's StarCraft II system reached a level above "above | 99.8% of officially ranked human players.", so it isn't trivial | to argue that this amounts to defeating top players. | kevinwang wrote: | < OpenAI's DoTA 2 system wasn't playing the full game. I think | the final version could play 17 of the 117 heroes, and the | opposing human players were also restricted to playing this | subset of the game. | | The bigger issue in my eyes was that while OpenAI 5 defeated | the world champion team OG, when they let anyone in the world | fight it, some ingenious players figured out a pretty robust | method to consistently exploit and defeat the bot. As I haven't | heard any buzz about OpenAI 5 since then, I think it was more | or less unsuccessful unless they can show that their training | method produces unexploitable bots (instead of bots that are | really good against certain strategies) | Gunax wrote: | They can train the bots based on on those games though, | right? Seems more like a flaw in the training data than the | principle. | | I am not sure if the training is done live or not--that is | does the algorithm learn based on each game against a real, | live player? Or do they just train the model offline, then | allow players to play against the static model? | kevinwang wrote: | > They can train the bots based on on those games though, | right? Seems more like a flaw in the training data than the | principle. | | I guess you could phrase it that way, but that's | essentially the problem statement for developing a strategy | for an imperfect-information game. So I would say it is a | flaw in the principle if their final output is exploitable. | CyberRage wrote: | Training requires millions of games. playing against humans | is only for evaluation purposes, not for training. | | In both cases, it was indeed a static model but more recent | work which is called MuZero is not static and achieves | great results in board games and atari. | ZephyrBlu wrote: | Assuming the OpenAI model is similar to DeepMind's | AlphaStar model, the model is static. | | And a few games of being exploited is nowhere near enough | data for the AI to be re-trained. | jrumbut wrote: | Sibling posters have pointed out technical issues but I'd | like to point out that while chasing after each successive | exploit might let them stay on top of the Starcraft | rankings, it changes the accomplishment from "we made a | model that understands Starcraft better than almost any | human" to "we made a model that memorized a selection of | very good strategies and tactics." | | When it was first demonstrated, it really looked like it | was doing very smart things (while also taking advantage of | the fact that it doesn't have attention lapses and hand | fatigue) and reacting well to different strategies on a | level that was freaky to me. | taberiand wrote: | I think this is kind of pedantic. They built an AI agent to | take pixel data as input, and provide mouse movements and | clicks as output, and rather than just flail around like a baby | it actually played the games with a sophisticated competency. | This to me is such an incredible achievement that I have no | doubt that it could be enhanced to defeat top players | consistently and easily. | | As another commenter remarks, there are holes to plug in terms | of exploitable behaviours that are locked into the model, but | this too I'm confident they will find a general method of | preventing; on the other hand, it's not like humans aren't | susceptible to similar exploits by competitors in situations | where they decide to cease innovation/learning | TaupeRanger wrote: | Not at all. It is a computer, of course it beats humans at | optimization problems and speed. Not remotely surprising or | interesting, any more than a calculator doing arithmetic | faster than a human. | ZephyrBlu wrote: | > _As another commenter remarks, there are holes to plug in | terms of exploitable behaviours that are locked into the | model, but this too I 'm confident they will find a general | method of preventing_ | | The problem is, I don't think there is a "general method of | [prevention]" because that's not how neural networks work. | | It's not easy to fix things like this because you can't just | say "yeah just don't do that dumb thing anymore", the network | has to be re-trained to learn the exploit. | | The way DeepMind tried to get around this is by having a | league of AIs playing against each other which try to exploit | each other and expose their weaknesses. It worked pretty damn | well, but people still found ways to exploit the AI. | wnevets wrote: | > OpenAI's DoTA 2 system wasn't playing the full game. I think | the final version could play 17 of the 117 heroes | | Limiting the number of playable heros in DoTA2 really isn't | important when it comes to evaluating the skill of the AI. Most | real players trying hard to win already play with a limited | hero pool dictated by the current patch verison. | alach11 wrote: | It is important. They removed a lot of champions with | complicated mechanics that could have been much harder for | the AI to play against. | wnevets wrote: | As someone who has played HoN & DoTA2 for over a decade I'm | telling you it isn't important _when_ evaluating the | ability of an AI to actually play the game. | | Drafting can be massive when deciding the outcome of games | even at the lower skill levels. Opening up the entire hero | pool just means you're largely evaluating the ability to | draft in the current patch more than actual playing | ability. | porphyra wrote: | At Blizzcon 2019, Alphastar beat Serral, undeniably a top | player (although he didn't get to use his own keyboard and | settings or get to prepare). Serral was able to beat the Terran | agent though. | | https://www.youtube.com/watch?v=nbiVbd_CEIA | ZephyrBlu wrote: | People also cheesed the shit out the bot and won though. None | of these AIs have proved to be robust to exploitation yet. | CyberRage wrote: | StarCraft is built in such a way that you can't create a | perfect, 100% winrate agent. | | Since there is hidden information, you could always miss a | corner of the map where the enemy hidden some units and you | lose the game. | | Is Alphastar "perfect"? no. Is it better than 99.9% of all | humans? absolutely. | | You don't need to create a perfect agent in most cases, | self driving a classic example. | | If you were to deploy an agent that drives 95% better than | all humans the effects would be huge. | | It would still fail in some scenarios where professional | drivers won't be it doesn't really matter because most | people are not that. | Dylan16807 wrote: | > StarCraft is built in such a way that you can't create | a perfect, 100% winrate agent. | | The worry isn't about perfect winrate, it's about finding | strategies that can _consistently_ cause the AI to lose | over and over. | | In a cooperative environment, a high percentage is great. | | In a competitive environment, that .1% of scenarios where | it's really weak will suddenly become the majority of | games it faces. | [deleted] | ZephyrBlu wrote: | I know the bot will never have 100% winrate, but I think | it shouldn't be able to be exploited (I.e. repeatedly | beaten using the same strategy). | | Let me give you an example [0]. When AlphaStar was | playing on the ladder a player in Diamond league | (~70-80th percentile) beat AlphaStar easily using mass | Ravens. If you're not aware of the strategy, it's a | turtle strategy where the player masses air units and is | generally terrible. | | But AlphaStar was confused by the strategy, and so it | lost by a large margin. | | Deploying an AI which can be exploited like this is | asking for trouble. | | [0] https://www.reddit.com/r/starcraft/comments/cgzieq/al | phastar... | neatze wrote: | by any chance, do you know replay ID of this game ? | CyberRage wrote: | https://www.youtube.com/watch?v=Di-yRj6TIK8 | CyberRage wrote: | But that could be fixed technically. Deepmind's goal was | not to create an "unexploitable" agent but to prove that | ML algorithms can cope with complex, dynamic environments | such as StarCraft. | | It seems to you weird but the same agent probably wins | against GM's most of the time. humans have weaknesses | too. | | The AI simply leans on its strengths just like humans do. | ZephyrBlu wrote: | > _It seems to you weird but the same agent probably wins | against GM 's most of the time. humans have weaknesses | too_ | | This is the whole problem though. AlphaStar beats GMs but | can lose to weird strategies. | | On the other hand, GMs will almost never lose (Most | likely >99% winrate) to a Diamond player no matter how | weird their strategies are. | | The AI has strengths, but it also has _glaring_ | weaknesses. Imagine if you had an AI flying a plane and | 99% of the time it was far better than a human pilot but | 1% of the time it crashed and killed everyone. I would | not fly on that plane. | | Maybe a bunch more training data and time would solve | this type of problem, but I'm skeptical. | CyberRage wrote: | You're beautifully showing the human nature which can be | problematic in my opinion. | | First of, no human player achieves 99% winrate against | diamond players. there are many cheeses, one miss-step | and you lose. GM's can lose to Diamond players. | | Now for the main part, you're saying and I'm rephrasing | here: | | Even if the AI is statistically better than humans | because it has some weaknesses I'm going to prefer the | human. | | But still at the end of the day, the AI does a better job | on average and will be safer to use than human pilots! | | We already heavily rely on software\algorithms for our | most important things. all modern vehicles use electronic | systems that monitor\manage several key components, stock | market is heavily managed by bots. | | If AI can do a significantly better job than human, I | would choose the AI, even if it behaves strange in that | 0.1% of cases. humans are not as reliable as you think. | ZephyrBlu wrote: | > _First of, no human player achieves 99% winrate against | diamond players. there are many cheeses, one miss-step | and you lose. GM 's can lose to Diamond players_ | | They definitely would. You underestimate the difference | in skill. Top players almost always beat other GM players | and maintain very high winrates in top GM. | | See for yourself: https://www.nephest.com/sc2/?season=46& | queue=LOTV_1V1&team-t... | | > _But still at the end of the day, the AI does a better | job on average and will be safer to use than human | pilots!_ | | I agree, but only if that 1% or 0.1% or whatever is not | exploitable by someone malicious. | CyberRage wrote: | The link includes players with vastly lower winrate and | players with high winrates but for extremely low number | of games. | | We need sufficient quantities to claim 99% winrate, for | highly ranked players even with 200 games(which is still | a low number since a single loss can massively affect | results) are not even close to 80% winrate. probably with | enough games it will be even lower. | | Maintaining 99% winrate is extremely hard as you can only | lose a single game out of 100. people get tired, try new | stuff, simply don't pay attention or just get caught off | guard by a new thing. | | As for "malicious exploitation", it does poses a risk in | some environments but the question then becomes exactly | the same. | | Is the AI less exploitable than the average person? | | If so, it doesn't matter. | ZephyrBlu wrote: | > _Is the AI less exploitable than the average person?_ | | People are generally not exploitable in the same way an | AI is because we can subjectively assess situations and | learn on the fly. | | This is a good example of why I think your argument | doesn't hold water: | https://twitter.com/nikitabier/status/1372726911105855488 | | On the 99% winrate, I feel like you're either being | purposefully obtuse or have no experience with | competitive games. | | Majority of the winrates are >70%, but even 60% is insane | for a competitive game _especially_ at the very highest | level. It is ridiculously hard to maintain a winrate this | high even over 30 games. | | You seem to be thinking about this from a statistical | perspective (I.e. moar samples) without realizing that | this is baked into MMR (You're matched with opponents as | close to your skill level as possible). These players | _have_ to maintain high winrates just to _stay_ at this | MMR because they can earn as low as literally 0 MMR for a | win and lose up to 60 MMR for a loss. | | These players are also around 3000 MMR higher than | Diamond players. Using the Elo model [0], this equates to | a 99.998% winrate. | | 100 games in a row is also not feasible. That's ~20 hours | of playtime assuming 12min games. | | [0] https://www.reddit.com/r/starcraft/comments/7fc30w/7_ | orders_... | neatze wrote: | It is false that AlphaStar learns like humans do. | exdsq wrote: | The difference between the top 0.2% and top 500 players is huge | too | andyljones wrote: | In case anyone misses the links, this is twinned with two other | superb posts - one about general lessons the author learned over | the course of the project | | https://clemenswinter.com/2021/03/24/my-reinforcement-learni... | | and one history of the project | | https://clemenswinter.com/2021/03/24/conjuring-a-codecraft-m... | Yenrabbit wrote: | This is an excellent project with a great write-up. Most articles | this long would loose me but this is engaging and clear, a joy to | read. And I'm in awe of the amount of work that has gone into | every aspect of this. | | >Seeing as my policies are currently the world's best CodeCraft | players, we'll just have to take their word for it for the time | being. | | I really hope this inspires some competition! How long until | there is a leaderboard? :) | shmageggy wrote: | Agreed, this is better than the vast majority of machine | learning papers that actually get published. The ablation | section is particularly nice. It is really a major failing of | the field that in most papers, it's entirely unclear what | aspect of the model (or which particular hacks) are really | carrying the weight. | mindfulplay wrote: | This is a fantastic project and a great blog! As games start to | include RL, it will be a lot of fun that could spawn a while new | generation of interesting games (especially if games are made | with an RL-first mindset as opposed to using RL later on to beat | human beings). | | Do you have recommendations to learn more about RL? Is CodeCraft | a game? | cwinter wrote: | Thank you for the kind words! I am also quite excited about the | new points in game design space that RL will unlock and am | planning write another blogpost on that topic. | | I quite like https://karpathy.github.io/2016/05/31/rl/ as an | introduction to some of the ideas behind modern RL. Beyond | that, I just recently found out about | https://github.com/andyljones/reinforcement-learning-discord... | which lists a lot of other high-quality resources. | | CodeCraft is a programming game which you can "play" by writing | a Scala/Java program that controls the game units. It's not | actively developed anymore but still functional: | http://codecraftgame.org/ | CyberRage wrote: | Interesting blog-post. | | I found some similarities with what occurred with Deepmind's | Alphastar AI. | | One of the weaknesses that seem to manifest in this piece too is | the handling of unfamiliar scenarios. | | The AI is very confused once it experiences something that was | rarely seen in its learning data. Destroyer's big drones confused | the bot quite a bit. | | Deepmind solved it by intentionally creating agents that | introduce different\bizzare strategies(which they called | exploiters) in order to develop robustness against such | strategies. | cwinter wrote: | The bot has actually never seen Destroyer's big drones during | training even once, so I found it somewhat surprising that it | even works as well as it does! | | Completely agree that adding something like the "League" used | by AlphaStar would be one of the top priorities if you wanted | to push this project further. I don't think CodeCraft is | sufficiently complex to really allow for several very distinct | strategies in the same way as StarCraft II, but I would still | expect training against a larger pool of more diverse agents to | increase robustness quite a bit. | CyberRage wrote: | What amazes me at the end of the day is that brute-forcing | seem to do much better than I initially thought it would do. | | Trying random stuff just sounds stupid but with enough | compute and data, I guess it could overpower smart creatures | like us. | | I agree that CodeCraft is vastly simpler than StarCraft but | the idea is the same. just try random stuff(sometimes with | better logic behind it) until something works and then | optimize it to perfection. ___________________________________________________________________ (page generated 2021-03-24 23:00 UTC)