(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org. Licensed under Creative Commons Attribution (CC BY) license. url:https://journals.plos.org/plosone/s/licenses-and-copyright ------------ “Liking” as an early and editable draft of long-run affective value ['Peter Dayan', 'Mpi For Biological Cybernetics', 'Tübingen', 'University Of Tübingen'] Date: 2022-01 Psychological and neural distinctions between the technical concepts of “liking” and “wanting” pose important problems for motivated choice for goods. Why could we “want” something that we do not “like,” or “like” something but be unwilling to exert effort to acquire it? Here, we suggest a framework for answering these questions through the medium of reinforcement learning. We consider “liking” to provide immediate, but preliminary and ultimately cancellable, information about the true, long-run worth of a good. Such initial estimates, viewed through the lens of what is known as potential-based shaping, help solve the temporally complex learning problems faced by animals. Introduction Berridge and his colleagues [1–4] have long argued that there is a critical difference between “liking” and “wanting.” The scare quotes are copied from papers such as Morales and Berridge’s paper [1] to distinguish the more precise quantities that these authors have in mind from the arguably more blurry everyday meanings of these terms or subjective reports that humans can provide upon verbal request. This distinction has been studied in greatest detail in the case of comestibles such as food and liquid; however, as we will see later, it applies more generally. Crudely, “liking” concerns the hedonic value of a good such as a food, whereas “wanting” refers to the motivational force that the good can exert in terms of reorganising the behaviour of the agent in its direction (be that by largely Pavlovian mechanisms, as in incentive sensitization [5,6], or also instrumental means [7,8]). “Liking,” which, for comestibles in animals, is typically assessed using characteristic orofacial reactions [9–11], is associated with activity in what is reported as a relatively fragile network of subareas in the gustatory and insular cortex, the ventral striatum, and the ventral pallidum, is broadly unaffected by dopaminergic manipulations but is modulated by opioids. By contrast, “wanting” arises from the robust dopaminergic systems connecting midbrain, striatum, and beyond. It might seem obvious that, in untechnical terms, liking and wanting should be umbilically connected, so that we like what we want, and vice versa. It is therefore surprising that this is apparently not always to be the case—it is often reported in the context of addiction that drugs that are keenly “wanted” (to a significantly detrimental extent) no longer generate substantial hedonic “liking” [5]. Furthermore, neuroeconomists have delineated an even wider range of utilities [12,13] whose mutual divergence can lead to anomalies. Thus, along with hedonic and decision utility, which are close to “liking” and “wanting,” respectively, are predicted utility (how much the outcome is expected to be “liked”) and remembered utility (what one remembers about how a good was previously “liked”)—and one could imagine “wanting” versions of these latter two utilities also. The area of food reward casts these issues in rather stark relief [14,15]. Thus, recent evidence is not consistent with the idea that overconsumption and obesity (putatively consequents of over-“wanting”) are increasing because of the devilishly clever “liking”-based hedonic packaging with sweet and fat taste and texture of relatively deleterious foods [16–18]. Instead, careful experiments dissociating the oral sensory experience of foods from their gastric consequences [19–22] suggest that it is the postingestive assessment by the gut of what it receives that is important for the (over)consumption. The substrate of this involving projections via the vagus nerve ending up in the dopamine system [23–25] is quite consistent with a role in “wanting.” Why then indeed should we have both “liking” and “wanting”? In this essay, we argue that “liking” systems play the role of what is known as potential-based shaping [26] in the context of reinforcement learning (RL; [27]). “Liking” provides a preliminary, editable, draft version of the long-run worth of a good [28]. By providing an early guess at a late true value, this can help with the notorious temporal credit assignment problem in RL [27], which stems from the fact that, in most interesting domains, agents and animals alike have to wait for a long period of time and/or make whole sequences of choices before finding out, much later, whether these were appropriate. These preliminary, draft, hedonic values thus steer animals towards what will normally be appropriate choices—making learning operate more effectively. RL borrowed the term “shaping” from psychology [29–31] to encompass a number of methods for improving the speed and reliability of learning—just like the effect we are arguing for here. One class of methods systematically adds quantities to “true” underlying rewards; however, like many methods that manipulate utilities, unintended consequences are rife. Potential-based shaping was suggested by Ng and colleagues [26] as a variant that is guaranteed not to have such consequences and indeed is equivalent to a typically optimistic initialization of the estimation of values [32]. In the case of victuals: for survival, animals actually care about the nutritive value of foods (which is why they underpin “wanting”)—this is the long run worth. However, it takes time for the digestive system to process these foods to determine their underlying value, making it difficult to criticise or reward the actions that led to them in the first place. This is exactly the temporal credit assignment problem. Instead, exteroceptive sensory input from the mouth and nose (and even the visual system) underpins a guess at this true value—providing immediate hedonic feedback for the choice of action. Usually, this guess is good, and so the two systems harmonise seamlessly. Given disharmony, it is the nutritive value that should determine ultimate choice, as described above. Thus, even if the orofacial “liking” responses might themselves not be manipulated by “wanting” system substrates such as dopamine, it is by activating dopamine systems in particular patterns that hedonic value can act appropriately. We first describe conventional model-free methods for prediction in RL, and the role of potential-based shaping in this. We then use the case of flavour–nutrient conditioning to suggest how the systems concerned might interact. Finally, in the discussion, we touch on some more speculative suggestions about the underlying source of utility in the context of homeostatic RL [33] and discuss a version of the same argument, but now for aesthetic value [34]. Model-free RL In the main part of this essay, we concentrate on Pavlovian conditioning [35]—the case in which predictions about future, potentially valuable outcomes lead to automatic actions such as approach, engagement, and even licking (whether or not those actions are actually useful for acquiring those outcomes; [36]). Thus, we focus on problems of evaluation and save consideration of the choice between actions for later. We consider a Markov prediction problem in a terminating, episodic case with no temporal discounting. Here, there are connected, nonterminal states, , a special terminating state s*, a transition matrix among just the nonterminal states, , , with the remaining probability being assigned to the terminating state and rewards associated with state s (which we will assume to be deterministic for convenience; also writing vector r for the rewards for all states); and r s* = 0. Then, if we write for the long run value of state (the value of s* is 0), and vector V for all the values, we have (1) (2) by writing the recursion directly (and noting that excludes the terminating state, which means that is invertible). The simplest form of temporal difference (TD) learning [27,37] attempts to learn the values V s from stochastic trajectories s 1 , s 2 , s 3 ,…,s* generated by sampling from . TD accomplishes this by constructing a prediction error from the sampled difference between right and left side of Eq 1 (3) and applying (4) where α is the learning rate. There is substantial evidence that the phasic activity of at least some dopamine neurons in the ventral tegmental area (VTA) of the midbrain, and the release of dopamine in target regions such as the nucleus accumbens, reports the TD prediction error δ t of Eq 3 [7,38–42]. In cases of instrumental conditioning, when actions must also be chosen, the prediction error δ t can also be used to critize a choice (in an architecture called the actor-critic; [43]). The idea is that actions that lead to either unexpected good rewards (judged by ) or unexpectedly good states (judged by large predicted long-run future rewards, ) should be more likely to be chosen in the future. This can be measured by δ t . Although TD learning is powerful, offering various guarantees of convergence when the learning rate α satifies suitable conditions, it has the problem of being sometimes slow. To illustrate this, we consider a case related to the one that we will consider later in flavour–nutrient conditioning. Fig 1A shows a case in which from a start state s = s0, there is a high probability (p = 0.7) transition directly to the terminal state s*, and a low probability transition to state s = s1, associated with an observation (later modelling the oral sensation of a morsel of food) and which initiates a sequence of T states leading to a rewarding outcome rT = 1 (later modelling the gut’s evaluation of this morsel) and then the terminal state s*. Fig 1B depicts the course of learning of the value structure associated with selected states, applying Eqs 3 and 4. The upper plot depicts the average value (across 1,000 simulations) for all nonterminal states as a function of learning trial. As expected for this sort of complete serial compound stimulus representation [44,45] in which every time step following the morsel of food is separately individuated, the value of the reward available at sT apparently propagates backwards to s1. The further propagation to s0 is then affected by the stochasticity at that single state. The lower plot shows the evolution of for one single run; the slow rise and stochastic fluctuations are evident. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 1. TD-based Markov prediction. (A) Simple Markov prediction problem with a tasty morsel provided at t = 1 (s = s1) with probability p = 0.3, which leads to a digestive reward of rT = 1 at time T. (B) Evolution of the value for the application of TD learning to the case that T = 10. Upper plot: average over 1,000 simulations (here, and in later figures, we label state si by just its index i); lower plot: single simulation showing . (C) Evolution of the TD prediction error δ t over the same trials. Upper plot: average over 1,000 simulations; lower plots: single simulation showing δ 0 for a transition to s = s1 (above); or to s = s* (below). Here, α = 0.1. TD, temporal difference. https://doi.org/10.1371/journal.pbio.3001476.g001 Fig 1C shows the prediction errors that occasion the learning of the values shown in Fig 1B. For convenience, in the single example beneath, we have separated the case that the transition from s0 is to s1, and ultimately to the actual reward at sT (upper) from the case that the transition is to s*, and thus no reward (lower). Given that the average value of , the former transition is associated with a positive prediction error; the latter with a negative one. Note that at the end of learning, the only prediction error arises at time t = 0, because of the stochasticity associated with the transition to s1 versus s*. At all other states, predictions are deterministically correct. Again, with the complete serial compound stimulus representation, over the course of learning, the prediction error apparently moves backwards in time during the trial—a phenomenon that has been empirically rather elusive, at least until very recently [46]. The most salient characteristic of the learning in this case is its sloth—apparent in the averages and the single instance. There are two reasons for this: First, p is low, which means that the agent usually fails to sample s1 and the cascade leading to the reward. The second is that the learning rate α = 0.1 is rather modest. Increasing α leads to faster learning, but also greater fluctuations in the values and prediction errors. Particularly, in this simple case, it would be possible to speed up learning somewhat by using a temporally extended representation of the stimulus [45,47] or an eligibility trace (the λ in TD(λ); [37]). However, in general circumstances, these can be associated with substantial variability or noise—particularly for long gaps as between ingestion and digestion—and so would not be a panacea in our case. Sophisticated modern models of conditioning that provide a substantially more neurobiologically faithful model of the learning in temporally extended cases (e.g., [48]) also currently concentrate on relatively modest time gaps. Potential-based shaping Shaping was originally suggested in the context of policy learning as a way of leading subjects through a sequence of steps in order to facilitate learning of good performance in a particular task [30]. The idea is to provide a set of intermediate (typically state- and/or action-dependent) rewards that are different from those specified by the task itself in order to provide an easier path for animals to learn appropriate final behaviour. The benefit of this has also been recognised in RL (e.g., [26,49], also leading to ideas about intrinsic rewards [50], by contrast with the extrinsic rewards that are determined by the task). The benefits of such intermediate rewards come on top of those of improved representations such as those mentioned above. Citing entertaining examples such as the microcircling bicycle of Randløv and Alstrøm [49], Ng and colleagues [26] observed that manipulating the reward structure (r s in our terms) can have unintended consequences—skewing predictions (and, in instrumental cases, choices) away from their optimal values for the original task. They therefore suggested a scheme called potential-based shaping, which could steer learning but with a guarantee of no asymptotic effect. This involves adding a function of state ϕ s to TD error terms such as that in Eq 3, making it (5) The name potential-based shaping comes from the fact that summing the net effect of ϕ in cycles of states is 0, because it appears in difference form—thus, it satisfies the same sort of no-curl condition as a conventional potential function. This means that it does not distort the values ascribed to states at the asymptote of learning when the predictions have converged. However, the idea is that the shaping function provides a hint about the values of states—being large for states that are associated with large long-run reward. Thus, a transition from a state s t = s to s t+1 = s′ when ϕ s is low and ϕ s′ is high will provide immediate positive error information allowing the value V s for state s to be increased even if V s′ has not yet been learned and so is still 0. In an instrumental conditioning case, the resulting high value of δ t will also be useful information that the action just taken that led to this reward, and transition is also unexpectedly good and so is worth favouring (as a form of conditioned reinforcement; [51]). For the Markov prediction problem of Fig 1, the appropriate shaping function associated with the morsel of food is rather straightforward—it should be ϕ s = 1 for s = s1…sT−1 and . The reason is that ingestion of the morsel with its sweet taste (at s1) predicts the benefit of digestion (at sT) for all those future times. Formally, the hedonic value is generated by . Fig 2 shows the course of learning in the Markov prediction problem of Fig 1, given these perfect shaping function (shown in Fig 2A). It is apparent that acquisition of the correct value for is greatly accelerated, as is the advent of the correct set of prediction errors (which are immediately zero for s≠s0). This shows the benefit of shaping. The agent can learn quickly that the state giving access to the morsel of food is itself appetitive. Furthermore, in a more complex problem in which there is a choice between actions, one of which provided access to s0, this action could also be learned as being worth 0.3 units of reward. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 2. TD-based Markov prediction with perfect shaping. (A) The ideal shaping function ϕ (blue circles) is 1 after acquisition of the food (at s1) until the reward arrives (red cross at sT). (B) Evolution of the value for the application of TD learning to the case that T = 10. Upper plot: average over 1,000 simulations; lower plot: single simulation showing . (C) Evolution of the TD prediction error δ t over the same trials. Upper plot: average over 1,000 simulations; lower plots: single simulation showing δ 0 for a transition to s = s1 (above); or to s = s* (below). Here, α = 0.1. TD, temporal difference. https://doi.org/10.1371/journal.pbio.3001476.g002 Note also an important difference between Figs 1B and 2B—namely that, at the end of learning, V s = 0 for s = sτ, τ≥1 in the latter, but not the former. The reason for this is that the prediction error is 0 for t≠0 because of the perfection of the shaping function—implying that there is nothing to learn for the states that lie between ingestion and digestion. Thus, Fig 2C shows that there is no prediction error within a trial either (and so backward propagation thereof), except just at the start state. In fact, the total prediction of the long-run reward from a state is V s +ϕ s . It has thus also been observed that a perfect substitute for this sort of potential-based shaping is to initialize V s = ϕ s , and then use standard TD learning, as in Eqs 3 and 4 [32]. However, although this is technically correct, it is not suitable for our purposes of flavour–nutrient conditioning since it does not respect a separation between taste processing and conditioning mechanisms. If the shaping function ϕ s is not perfect, then the course of learning will be at least partially disrupted. Fig 3 shows a case in which the shaping function decays linearly from , as if the prediction from the taste system associated with the future digestive benefit cannot last as long as the time that the gut takes to process the food morsel. Furthermore, as a very abstract model of the time the digestive system might take to process the food, the same total reward is spread over five time steps. PPT PowerPoint slide PNG larger image TIFF original image Download: Fig 3. TD-based Markov prediction with a partial shaping function. (A) A suboptimal shaping function ϕ that decreases from 1 to 0 linearly after acquisition of the food (at s1), and with reward spread over five time steps (red crosses; note the extension of the state space to T = 15). (B) Evolution of the value for the application of TD learning to this case. Upper plot: average over 1,000 simulations; lower plot: single simulation showing . (C) Evolution of the TD prediction error δ t over the same trials. Upper plot: average over 1,000 simulations; lower plots: single simulation showing δ 0 for a transition to s = s1 (above); or to s = s* (below). Here, α = 0.1. TD, temporal difference. https://doi.org/10.1371/journal.pbio.3001476.g003 In this case, the prediction learns very quickly at first, but then temporarily modestly decreases (between around trials 200 to 400 in the example) before recovering. The suppression arises since δ t <0 for t = 1…T−1 on early learning trials (since is decaying linearly over these times), and this negative prediction error propagates backwards to influence V 0 . Later, the positive prediction error that starts associated with the digestive report of the nutritive value (i.e., rT = 1) itself propagates back to overwhelm the suppression. Furthermore, the asymptotic value V s comes over the course of learning exactly to compensate for the inadequacy of the shaping function such that V s +ϕ s is the long-run reward from state s. [END] [1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001476 (C) Plos One. "Accelerating the publication of peer-reviewed science." Licensed under Creative Commons Attribution (CC BY 4.0) URL: https://creativecommons.org/licenses/by/4.0/ via Magical.Fish Gopher News Feeds: gopher://magical.fish/1/feeds/news/plosone/