Abstract

Full text Figures and data Side by side Abstract Editor's evaluation Introduction Results Discussion Materials and methods Appendix 1 Appendix 2 Data availability References Decision letter Author response Article and author information Metrics Abstract The basal ganglia (BG) contribute to reinforcement learning (RL) and decision-making, but unlike artificial RL agents, it relies on complex circuitry and dynamic dopamine modulation of opponent striatal pathways to do so. We develop the OpAL* model to assess the normative advantages of this circuitry. In OpAL*, learning induces opponent pathways to differentially emphasize the history of positive or negative outcomes for each action. Dynamic DA modulation then amplifies the pathway most tuned for the task environment. This efficient coding mechanism avoids a vexing explore–exploit tradeoff that plagues traditional RL models in sparse reward environments. OpAL* exhibits robust advantages over alternative models, particularly in environments with sparse reward and large action spaces. These advantages depend on opponent and nonlinear Hebbian plasticity mechanisms previously thought to be pathological. Finally, OpAL* captures risky choice patterns arising from DA and environmental manipulations across species, suggesting that they result from a normative biological mechanism. Editor's evaluation This paper provides a formal analysis of the normative advantage of the opponent pathways of the basal ganglia circuit for cost-benefit decision-making. Specifically, a previously introduced Hebbian nonlinearity is combined with reward-based DA modulation to optimize exploration across lean and rich environments, and across a range of pharmacological and contextual manipulations. The scope of the model, its biological plausibility, and its normative and descriptive aspects are likely to have a significant impact. https://doi.org/10.7554/eLife.85107.sa0 Decision letter eLife's review process Introduction Everyday choices involve integrating and comparing the subjective values of alternative actions. Moreover, the degree to which one prioritizes the benefits or costs in forming subjective preferences may vary between and even within individuals. For example, one may typically use food preference to guide their choice of restaurant, but be more likely to minimize costs (e.g., speed, distance, price) when only low-quality options are available (only fast-food restaurants are open). In this article, we evaluate the computational advantages of such context-dependent choice strategies and how they may arise from biological properties within the basal ganglia (BG) and dopamine (DA) system. In ecological settings, there are often multiple available actions, and rewards are sparse. In machine learning, this combination is particularly vexing for reinforcement learning (RL) agents due to a difficult exploration/exploitation tradeoff (Sutton and Barto, 2018), and approaches to confront this problem typically require prior task-specific knowledge (Riedmiller et al., 2018). We set out to study how the architecture of biological RL might additionally circumvent this problem. We find that biological properties within this system – specifically, the presence of opponent striatal pathways, nonlinear Hebbian plasticity, and dynamic changes in dopamine as a function of reward history – confer decision-making advantages relative to canonical RL models lacking these properties. In so doing, this analysis provides a new lens into various findings regarding how learning and decision-making is altered across species as a function of manipulations of (or individual differences within) the BG and DA systems. To begin, we focus on bandit learning tasks, where an agent learns to identify and reliably select the option which yields the highest rate of probabilistic reward. We consider how biological properties with the BG allow an agent to effectively explore early (sample options that are currently estimated as unfavorable but are possibly more rewarding), and subsequently better exploit (reliably select the most rewarding action). As we shall see, this entails (1) learning separate ‘actors’ that magnify the relative benefits of alternative options in highly rewarding environments or the relative costs in sparsely rewarding environments and (2) dynamically shifting the contribution of these actors to govern action selection, depending on which is more specialized for the context. We then show how this dynamic biological mechanism can be recruited for risky decision-making, where increased dopamine amplifies the contribution of benefits over costs, leading to riskier choice; lowered dopamine alternatively amplifies the costs over the benefits. In neural network models of such circuitry, the cortex ‘proposes’ candidate actions available for consideration, and the BG facilitates those that are most likely to maximize reward and minimize cost (Frank, 2005; Ratcliff and Frank, 2012; Franklin and Frank, 2015; Gurney et al., 2015; Dunovan and Verstynen, 2016). These models are based on the BG architecture in which striatal medium spiny neurons (MSNs) are subdivided into two major populations that respond in opponent ways to DA (due to differential expression of D1 and D2 receptors; Gerfen, 1992; Burke et al., 2017). Phasic DA signals convey reward prediction errors (Montague et al., 1996; Schultz et al., 1997), amplifying both activity and synaptic learning in D1 neurons, thereby promoting action selection based on reward. Conversely, when DA levels drop, activity is amplified in D2 neurons, promoting learning and choice that minimizes disappointment (Frank, 2005; Iino et al., 2020). See Figure 1A for a visual summary of this opponency. Figure 1 with 1 supplement see all Download asset Open asset Overview of OpAL* and dynamics of three-factor Hebbian term. (A) OpAL* architecture. Akin to the original OpAL model (Collins and Frank, 2014), OpAL* is a modified dual actor-critic model where the critic learns action values and generates reward prediction errors (RPEs); the actors use these RPEs to directly learn a policy (i.e., how to behave). For each action, the representation according to one actor (representing the D1 pathway) is strengthened by positive RPEs and weakened by negative RPEs (encoded by dopamine burst and dips, respectively). In contrast, positive RPEs weaken and negative RPEs strengthen the second actor’s action representations (representing the D2 pathway). Uniquely, OpAL* modulates dopamine levels at the time of choice according to a ‘meta-critic,’ which tracks the value or ‘richness’ of the overall environment according to the agent’s reward history agnostic to action history. OpAL* also introduces additional features, such as annealing and normalization, that provide OpAL* with robustness and flexibility but preserve key properties of the OpAL model necessary for capturing empirical data. (B) Schematic of OpAL dynamics with three-factor Hebbian term. Nonlinear weight updates due to Hebbian factor lead to increasing discrimination between high reward probability options in the G actor and between low reward probability options in the N actor. For intermediate dopamine states (G and N actors are balanced), there is equal sensitivity to differences in reward probability across the range of rich and lean environments. For high dopamine states (βg>βn), the action policy emphasizes differences in benefits (as represented in the D1/"G" weights), whereas in low dopamine states (βg<βn), the action policy emphasizes differences in costs (as represented in the D2/"N" weights). Changes in dopaminergic state (represented by the purple indicators) affect the policy of OpAL due to its nonlinear and opponent dynamics. OpAL* hypothesizes that modulating dopaminergic state by environmental richness is a normative mechanism for flexible weighting of these representations. Empirically, the BG and DA have been strongly implicated in such motivated action selection and RL across species. For example, in perceptual decisions, striatal D1 and D2 neurons combine information about veridical perceptual data with internal preferences based on potential reward, causally influencing choice toward the more rewarding options (Doi et al., 2020; Bolkan et al., 2022). Further, striatal DA manipulations influence RL (Yttri and Dudman, 2016; Frank et al., 2004; Pessiglione et al., 2006), motivational vigor (Niv et al., 2007; Beeler et al., 2012; Hamid et al., 2016), cost–benefit decisions about physical effort (Salamone et al., 2018), and risky decision-making. Indeed, as striatal DA levels rise, humans and animals are more likely to select riskier options that offer greater potential payout than those with certain but smaller rewards (St Onge and Floresco, 2009; Zalocusky et al., 2016; Rutledge et al., 2015), an effect that has been causally linked to striatal D2 receptor-containing subpopulations (Zalocusky et al., 2016). However, for the large part, this literature has focused on the findings that DA has opponent effects on D1 and D2 populations and behavioral patterns, and not what the computational advantage of this scheme might be (i.e., why). For example, the Opponent Actor Learning (OpAL) model (Collins and Frank, 2014) summarizes the core functionality of the BG neural network models in algorithmic form, capturing a wide variety of findings of DA and D1 vs. D2 manipulations across species (for review, Collins and Frank, 2014; Maia and Frank, 2017). Two distinguishing features of OpAL (and its neural network inspiration), compared to more traditional RL models, are that (1) it relies on opponent D1/D2 actors that separately learn benefits and costs of actions rather than a single expected reward value for each action and (2) learning in such populations is acquired through nonlinear dynamics, mimicking three-factor Hebbian plasticity rules. This nonlinearity causes the two populations to evolve to specialize in discriminating between options of high or low reward value, respectively (Collins and Frank, 2014), as seen in Figure 1B. It is also needed to explain pathological conditions such as learned Parkinsonism, whereby low DA states induce hyperexcitability in D2 MSNs, driving aberrant plasticity and, in turn, progression of symptoms (Wiecki et al., 2009; Beeler et al., 2012). But why would the brain develop this nonlinear opponent mechanism for action selection and learning, and how could (healthy) DA levels be adapted to capitalize on it? A clue to this question lies in the observation that standard (nonbiological) RL models typically perform worse at selecting the optimal action in ‘lean environments’ with sparse rewards than they do in ‘rich environments’ with plentiful rewards (Collins and Frank, 2014). This asymmetry results from a difference in exploration/exploitation tradeoffs across such environments. In rich environments, an agent can benefit from overall higher levels of exploitation: once the optimal action is discovered, an agent can stop sampling alternative actions as it is not important to know their precise values. In contrast, in lean environments, choosing the optimal action typically lowers its value (due to sparse rewards), to the point that it can drop below those of even more suboptimal actions. This causes stochastic switching between options until the worst actions are reliably identified and avoided in the long run. Moreover, while in machine learning applications one can simply tune hyperparameters of an RL model to optimize performance for a given environment, biological agents do not have the luxury as they cannot know whether they are in a rich or lean environment in advance and cannot modify hyperparameters accordingly. In this article, we investigate the utility of nonlinear BG opponency for adaptive behavior in rich and lean environments. We propose a new model, OpAL*, which dynamically adapts its dopaminergic state online as a function of learned reward history (as observed empirically; Hamid et al., 2016; Mohebi et al., 2019). Specifically, OpAL* dynamically modulates its dopaminergic states in proportion to its estimates of ‘environmental richness,’ leading to high striatal DA motivational states in rich environments and lower DA states in lean environments with sparse rewards. To do so, it relies on a ‘meta-critic’ that evaluates the richness/sparseness of the environment as a whole. Initially, low confidence in the meta-critic leads the agent to rely equally on both actors, with more stochastic choice as they learn to specialize. Thereafter, OpAL*’s opponent and nonlinear representations serve to directly and quickly optimize the model’s policy. In contrast, standard RL models that focus on learning the expected values of actions are slow to converge on the best policy, particularly as the number of alternative actions grows. In this article, we demonstrate that the specialization of D1 and D2 pathways in OpAL* for discriminating between low rewarding and high rewarding options, rather than estimating veridical reward statistics, allows OpAL* to better equate performance in rich and lean environments. This dynamic modulation amplifies the D1 or D2 actor most well suited to discriminate amongst benefits or costs of choice options for the given environment, akin to an ‘efficient coding’ strategy typically studied in the domain of perception (Barlow, 2012; Laughlin, 1981; Chalk et al., 2018). We compared the performance of OpAL* to alternative BG models and to several alternative models typically used in machine learning (Q-learning and upper confidence bound models, the latter of which includes an explicit mechanism intended to optimize exploration). We find that OpAL*, across a wide range of parameter settings, exhibits robust advantages over these alternatives across a range of environments with varying reward rates and complexity levels. This advantage depends on opponency, nonlinearity, and adaptive DA modulation and is most prominent in lean environments with large action spaces, an ecologically probable environment which requires more adaptive navigation of explore–exploit as outlined above. OpAL* also addresses limitations of the original OpAL model highlighted by Möller and Bogacz, 2019, while retaining key properties needed to capture a range of empirical data and afford the normative advantages. Finally, we apply OpAL* to capture a range of empirical data across species, including how risk preference changes as a function of D2 MSN activity and manipulations that are not explainable by monolithic RL systems even when made sensitive to risk (Zalocusky et al., 2016). In humans, we show that OpAL* can reproduce patterns in which dopaminergic drug administration selectively increases risky choices for gambles with potential gains (Rutledge et al., 2015). Moreover, we show that even in the absence of biological manipulations, OpAL* also accounts for recently described economic choice patterns as a function of environmental richness. In particular, we simulate data showing that when offered the very same safe and risky choice option, humans are more likely to gamble when that offer had been presented in the context of a richer reward distribution (Frydman and Jin, 2021). Similarly, we show that the normative objective for policy optimization in OpAL*, while in general facilitating adaptive behavior and transitive preferences, can lead to irrational preferences when options appear in novel contexts differing in reward richness of initial learning, as observed empirically (Palminteri et al., 2015). Taken together, our simulations provide a clue as to the normative function of the biology of RL which differs from that assumed by standard models and gives rise to variations in risky decision-making. OpAL overview Before introducing OpAL*, we first provide an overview of the original OpAL model (Collins and Frank, 2014), an algorithmic model of the BG whose dynamics mimic the differential effects of dopamine in the D1/D2 pathways described above. OpAL is a modified ‘actor-critic’ architecture (Sutton and Barto, 2018). In the standard actor-critic, the critic learns the expected value of an action from rewards and punishments and reinforces the actor to select those actions that maximize rewards. Specifically, after selecting an action (a), the agent experiences a reward prediction error (δ) signaling the difference between the reward received (R) and the critic’s learned expected value of the action (Vt⁢(a)) at time t: (1) δt=Rt-Vt⁢(a) (2) Vt+1⁢(a)=Vt⁢(a)+αc×δt, where αc is the critic learning rate. The prediction error generated by the critic is then also used to train the actors. OpAL is distinguished from a standard actor-critic in two critical ways, motivated by the biology summarized above. First, it has two separate opponent actors: one promoting selection (‘Go’) of an action a in proportion to its relative benefit over alternatives, and the other suppressing selection of that action (‘No Go’) in proportion to its relative cost (or disappointment). (See Supplemental note 1 in Appendix 2). Second, the update rule in each of these actors contains a three-factor Hebbian rule such that weight updating is proportional not only to learning rates and RPEs (as in standard RL) but is also scaled by Gt and Nt themselves. In particular, positive RPEs conveyed by phasic DA bursts strengthen the G (D1) actor and weaken the N (D2) actor, whereas negative RPEs weaken the D1 actor and strengthen the D2 actor. (3) Gt+1⁢(a)=Gt⁢(a)+αG⁢Gt⁢(a)×δt (4) Nt+1⁢(a)=Nt(a)+αNNt(a)×-δt where αG and αN are learning rates controlling the degree to which D1 and D2 neurons adjust their synaptic weights with each RPE. We will refer to these Gt and Nt terms that multiply the RPE in the update as the ‘Hebbian term’ because weight changes grow with activity in the corresponding G and N units. As such, the G weights grow to represent the benefits of candidate actions (those that yield positive RPEs more often, thereby making them yet more eligible for learning), whereas the N weights grow to represent the costs or likelihood of disappointment (those that yield negative RPEs more often). The resulting nonlinear dynamics capture biological plasticity rules in neural networks, where learning depends on dopamine (δt), presynaptic activation in the cortex (the proposed action a is selectively updated), and postsynaptic activation in the striatum (Gt or Nt) (Frank, 2005; Wiecki et al., 2009; Beeler et al., 2012; Gurney et al., 2015; Frémaux and Gerstner, 2015; Reynolds and Wickens, 2002). Incorporation of this Hebbian term prevents redundancy in the D1 vs. D2 actors and confers additional flexibility, as described in the next section. It is also necessary for capturing a variety of behavioral data, including those associated with pathological aberrant learning in DA-elevated and depleted states, whereby heightened striatal activity in either pathway amplifies learning that escalates over experience (Wiecki et al., 2009; Beeler et al., 2012; Collins and Frank, 2014). As we shall see in the ‘Mechanism’ section below, this same property allows actors to better represent the probabilistic history of outcomes at the low and high ranges. For action selection (decision-making), OpAL combines together Gt⁢(a) and Nt⁢(a) into a single action value, A⁢c⁢t⁢(a), but where the contributions of each opponent actor are weighted by corresponding gains βg and βn. (5) Actt(a)=βgGt(a)−βnNt(a) (6) βg=β(1+ρ) (7) βn=β(1−ρ) Here, ρ reflects the dopaminergic state controlling the relative weighting of βg and βn, and β is the overall softmax temperature. Higher β values correspond to higher exploitation, while β=0 would generate random choice independent of learned values. When ρ=0, the dopaminergic state is ‘balanced’ and the two actors G and N (and hence, learned benefits and costs) are equally weighted during choice. If ρ>0, benefits are weighted more than costs, and vice versa if ρ<0. While the original OpAL model assumed a fixed, static ρ per simulated agent to capture individual differences or pharmacological manipulations, below we augmented it to include the contributions of dynamic changes in dopaminergic state, so that ρ can evolve over the course of learning to optimize choice. The actor then selects actions based on their relative action propensities, using a softmax decision rule, such that the agent selects those actions that yield the most frequent positive RPEs: (8) p(a)=eActt(a)∑i∈AeActt(i), Nonlinear OpAL dynamics support amplification of action-value differences After learning, G and N weights correlate positively and negatively with expected reward, with appropriate ordinal rankings of each action preserved in the combined action value A⁢c⁢t (Collins and Frank, 2014). However, with extensive learning (particularly after the critic converges), the Hebbian term induces instability and decay in the G and N representations, such that they eventually converge to zero (Möller and Bogacz, 2019). OpAL* addresses this issue by adjusting learning rates as a function of uncertainty, stabilizing learned actor weights while also preserving their ability to flexibly adapt to change points, and by normalizing the prediction error. See ‘Normalization and annealing’ in the next section for a full discussion. These adjustments enable us to preserve the Hebbian contribution, which was previously found to be a necessary component for capturing a range of empirical data (Collins and Frank, 2014). Importantly for these findings and for the findings in this article, the Hebbian term produces nonlinear dynamics in the two actors such that they are not redundant and instead specialize in discriminating between different reward probability ranges (Figure 1B). While the G actor shows greater discrimination among frequently rewarded actions, the N actor learns greater sensitivity among actions with sparse reward. Note that if G and N actors are weighted equally in the choice function (ρ=0), the resultant choice preference is invariant to translations across levels of reward, exhibiting identical discrimination between a 90 and 80% option as it would between a 80% and 70% option. This ‘balanced’ OpAL model therefore effectively reduces to a standard nonopponent actor-critic RL model, but as such fails to capitalize on the underlying specialization of the actors (G and N) in ongoing learning. We considered the possibility that such specialization could be leveraged dynamically to amplify a given actor’s contribution when it is most sensitive, akin to an ‘efficient coding’ strategy applied to decision-making (Frydman and Jin, 2021). OpAL* Given the differential specialization of G vs. N actors, we considered whether the agent’s online estimation of environmental richness (reward rate) could be used to control dopaminergic states (as seen empirically; Hamid et al., 2016; Mohebi et al., 2019). Due to its opponent effects on D1 vs. D2 populations, such a mechanism would differentially and adaptively weight G vs. N actor contributions to the choice policy. To formalize this hypothesis, we constructed OpAL*, which uses an online estimation of environment richness to dynamically amplify the contribution of the actor theoretically best specialized for the environment type. To provide a robust estimate of reward probability in a given environment, OpAL* uses a ‘meta-critic,’ so-named because it evaluates the reward value of the environment as a whole given the agent’s overall choice history (i.e., policy to that point), rather than that of any particular state or action. The meta-critic summarizes the contributions of various inputs that may regulate the DA system, including those from cortical sources such as orbitofrontal cortex and anterior cingulate, regions which have access to not only the mean reward values but also their confidence (Kepecs et al., 2008). Notably, these regions also project to striatal cholinergic cells conveying information about environmental state (Stalnaker et al., 2016). These cholinergic cells in turn locally regulate striatal DA release (Adrover et al., 2020; Threlfell et al., 2012; Reynolds et al., 2022) in proportion to reward history (Mohebi et al., 2019), and may be sensitive to uncertainty (Franklin and Frank, 2015). As such, the meta-critic is represented as a beta distribution to estimate p^t⁢(r) for the environment as a whole (i.e., over all states and actions) or ‘context value.’ This distribution can be updated by keeping a running count of the outcomes (e.g., rewards and omissions) on each trial and adding them to the hyperparameters η and γ, respectively. (9) ηt+1c=ηtc+Rt (10) γt+1c=γtc+(1-Rt) (11) X∼Beta⁢(ηtc,γtc) (12) p^t⁢(r)=E⁢[X] The dopaminergic state ρ is then increased when p^t(r)>.5 (rich environment), and decreased when p^t(r)<.5 (lean environment). To ensure that dopaminergic states accurately reflect environmental richness, we apply a conservative rule to modulate ρ only when the meta-critic is sufficiently ‘confident’ that the reward rates are above or below 0.5, that is, we take into account not only the mean but also the variance of the beta distribution, parameterized by ϕ (Equation 13; for simplicity, we used ϕ = 1.0 for all simulations). This process is akin to performing inference over the most likely environmental state to guide DA. (See Supplemental note 2 in Appendix 2). Lastly, a constant k controls the strength of the modulation (Equation 14) (13) S={1if E[X]−ϕ std(X)>.51if E[X]+ϕ std(X)<.50otherwise (14) ρt=S×(E⁢[X]-.5)×k,k≥0 To illustrate the necessity of nonlinearity for dopamine modulation to be impactful, we plotted how A⁢c⁢t values change as a function of reward probability and for different DA levels (represented as different colors, Figure 2). While A⁢c⁢t values increase monotonically with reward probability, the convexity in the underlying G and N weights (Figure 1B) gives rise to stronger A⁢c⁢t discrimination between more rewarding options (e.g., 80% vs. 70%) with higher dopamine levels. Conversely, A⁢c⁢t discrimination between less rewarding options (e.g., 30% vs. 20%) is enhanced with lower dopamine levels. Thus, high DA amplifies the G actor’s contributions to choice, increasing the action gap for high-probability options. Conversely, low DA amplifies the N actor’s contributions, increasing the action gap for low-probability options. As the Bayesian meta-critic converges on an estimate of environmental richness, OpAL* can adapt its policy to dynamically emphasize the most discriminative actor and appropriately enhance the ‘action gap’ – Act difference between optimal and second best option – to optimize the policy (Figure 2, left). In contrast, a variant which lacks nonlinearity (No Hebb) induces redundancy in the G and N weights and thus essentially reduces to a standard actor-critic agent. As such, dopamine modulation does not change its discrimination performance across environments and the action gap for choice remains fixed (Figure 2, right; Figure 1—figure supplement 1). Figure 2 Download asset Open asset OpAL* capitalizes on convexity of actor weights induced by nonlinearity, allowing adaptation to different environments. A⁢c⁢t values are generated by presenting each model with a bandit using a fixed reward probability for 100 trials; curves are averaged over 5000 simulations. Left: nonlinearity in OpAL* update rule induces convexity in A⁢c⁢t values as a function of reward probability (due to stronger contributions of G weights with higher rewards, and stronger contributions of N weights with sparse reward). OpAL* dynamically adjusts its dopaminergic state over the course of learning as a function of its estimate of environmental richness (indicated by elongated, purple bars), allowing it to traverse different Act curves (high dopamine [DA] in green emphasizes the G actor, low DA in orange emphasizes the N actor). (Note that the agent’s meta-critic first needs to be confident in its estimate of reward richness of the environment during in initial exploration for it to adjust DA to appropriately exploit convexity.) Thereafter, OpAL* can differentially leverage convexity in G or N weights, outperforming a ‘balanced’ OpAL+ model (in yellow), which equally weighs the two actors (due to static DA). Vertical bars show discrimination (i.e., action gap) between 80% and 70% actions is enhanced with high DA state, whereas discrimination between 20% and 30% actions is amplified for low DA. Right: due to redundancy in the No Hebb representations, policies are largely invariant to dopaminergic modulation during the course of learning. Choice To accommodate varying levels of k and maintain biological plausibility, the contribution of each actor is lower-bounded by zero – that is, G and N actors can be suppressed but cannot be inverted (firing rates cannot go below zero), while still allowing graded amplification of the other subpopulation. (15) A⁢c⁢tt⁢(a)=βg⁢Gt⁢(a)-βn⁢Nt⁢(a) (16) βg=β⁢max⁡(0,1+ρt) (17) βn=β⁢max⁡(0,1-ρt) Normalization and annealing The original three-factor Hebbian rule presented in Collins and Frank, 2014 approximates the learning dynamics in the neural circuit models needed to capture the associated data and also confers flexibility as described above. However, it is also susceptible to instabilities, as highlighted by Möller and Bogacz, 2019. Specifically, because weight updating scales with the G and N values themselves, large reward magnitudes or oscillating prediction errors (due to critic convergence) can cause the weights to decay rapidly toward 0 (see Appendix 1 section ‘Addressing’; Möller and Bogacz, 2019). To address this issue, OpAL* introduces two additional modifications based on both func

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call