2021-04-05 ------------------------------------------------------------------ Here's a quote, somewhat edited, from the 80000 Hours Podcast, with Brian Christian Published 5th of March 2021 ------------------------------------------------------------------ They announced in 2015 that they had this single agent architecture that could play 60 different Atari games, the majority of them at superhuman level. However, if you scroll down the figure from their nature paper, you notice that at the very bottom of the list is this game called Montezuma's Revenge, where DQN scored a grand total of zero points. And so what was it about this particular game? Montezuma's revenge and Montezuma's revenge has what's called sparse reward. You have to do a lot of things right in order to get the first explicit kind of points of the game, you have to climb this ladder, run across this conveyor belt, jump a chasm, swing on a rope, dodge an enemy, climb another ladder and jump up, all to get this little key that gives you a hundred measly points. So the question was, how do you overcome this problem of sparsity? And, you know, the funny thing is humans have no problem playing this game. It's actually a very unfun game, but we have no problem knowing what to do. Part of it is just that we have this kind of intrinsic motivation that we want to know what happens if we get our character over to this part of the screen or we interact with this object or we go into a new room, what's over there? All of that is at least as motivating to a human player as the points in the corner of the screen. And so this group from deep mind decided to borrow explicitly from the cognitive science of what we know about babies and say, OK, we know that babies exhibit this thing called preferential looking, which is that from just two weeks old, if you follow their eye movements, they will prefer to look at something they've never seen before. So they created this supplementary reward system for their agent that would give it the equivalent of in-game points just for seeing an image on the screen that had never seen before. And suddenly now you have enough of a little breadcrumb trail to know you're on the right track. When you had this tendency, this can also lead to perverse outcomes potentially. There's a group from Berkeley and Open A.I. that put a TV screen inside one of the maze environments that their agent was trying to escape. And the TV screen totally hijacks this novelty reward. And I mean, it's very striking because you can watch videos of their agent navigating this maze. And as soon as it swings around such that the TV is in view, it's just paralysed or just glued to the TV screen. ------------------------------------------------------------------