2021-04-05
------------------------------------------------------------------

    Here's a quote, somewhat edited, from 
    the 80000 Hours Podcast, with Brian Christian
    Published 5th of March 2021

------------------------------------------------------------------

They announced in 2015 that they had this single agent
architecture that could play 60 different Atari games, the
majority of them at superhuman level. However, if you scroll 
down the figure from their nature paper, you notice that at the 
very bottom of the list is this game called Montezuma's Revenge, 
where DQN scored a grand total of zero points.

And so what was it about this particular game? Montezuma's 
revenge and Montezuma's revenge has what's called sparse reward.
You have to do a lot of things right in order to get the 
first explicit kind of points of the game, you have to climb this 
ladder, run across this conveyor belt, jump a chasm, swing on a 
rope, dodge an enemy, climb another ladder and jump up, all to get
this little key that gives you a hundred measly points.

So the question was, how do you overcome this problem of sparsity?
And, you know, the funny thing is humans have no problem playing 
this game. It's actually a very unfun game, but we have no problem
knowing what to do.

Part of it is just that we have this kind of intrinsic motivation
that we want to know what happens if we get our character over to
this part of the screen or we interact with this object or we go
into a new room, what's over there? All of that is at least as 
motivating to a human player as the points in the corner of the 
screen. And so this group from deep mind decided to borrow 
explicitly from the cognitive science of what we know about
babies and say, OK, we know that babies exhibit this thing 
called preferential looking, which is that from just two weeks
old, if you follow their eye movements, they will prefer to look
at something they've never seen before.

So they created this supplementary reward system for their agent 
that would give it the equivalent of in-game points just for
seeing an image on the screen that had never seen before. And 
suddenly now you have enough of a little breadcrumb trail to know
you're on the right track.

When you had this tendency, this can also lead to perverse
outcomes potentially.

There's a group from Berkeley and Open A.I. that put a TV screen
inside one of the maze environments that their agent was trying 
to escape. And the TV screen totally hijacks this novelty reward.
And I mean, it's very striking because you can watch videos of 
their agent navigating this maze. And as soon as it swings around
such that the TV is in view, it's just paralysed or just glued 
to the TV screen.

------------------------------------------------------------------