Decision letter: A model of hippocampal replay driven by experience and environmental structure facilitates spatial learning

Laura L Colgin,Payam Piray,H Freyja Ólafsdóttir

doi:10.7554/elife.82301.sa1

Abstract

Article Figures and data Abstract Editor's evaluation Introduction Results Discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract Replay of neuronal sequences in the hippocampus during resting states and sleep play an important role in learning and memory consolidation. Consistent with these functions, replay sequences have been shown to obey current spatial constraints. Nevertheless, replay does not necessarily reflect previous behavior and can construct never-experienced sequences. Here, we propose a stochastic replay mechanism that prioritizes experiences based on three variables: 1. Experience strength, 2. experience similarity, and 3. inhibition of return. Using this prioritized replay mechanism to train reinforcement learning agents leads to far better performance than using random replay. Its performance is close to the state-of-the-art, but computationally intensive, algorithm by Mattar & Daw (2018). Importantly, our model reproduces diverse types of replay because of the stochasticity of the replay mechanism and experience-dependent differences between the three variables. In conclusion, a unified replay mechanism generates diverse replay statistics and is efficient in driving spatial learning. Editor's evaluation This paper proposes a new, biologically realistic, computational model for the phenomenon of hippocampal replay. This is an important study with relevance for a broad audience in neuroscience. The proposed model convincingly simulates various aspects of experimental data discovered in the past. https://doi.org/10.7554/eLife.82301.sa0 Decision letter Reviews on Sciety eLife's review process Introduction Humans and other animals continuously make decisions that impact their well-being, be that shortly after emitting a choice or much later. To successfully optimize their behavior, animals must be able to correctly credit choices with resulting consequences, adapt their future behavior, and remember what they have learned. The hippocampus is known to play a critical role in the formation of new memories and retrieval of memories, as evidenced by the famous case of patient H.M. (Corkin et al., 1997) and others (Wilson et al., 1995; Rosenbaum et al., 2004). In rats and mice damage to the hippocampus is known to impair spatial learning and memory (Morris et al., 1982; Deacon et al., 2002). An important phenomenon linked to learning and memory found in the hippocampus is that of ‘replay’ (Buhry et al., 2011). As an animal navigates in an environment, so-called place cells (O’Keefe and Dostrovsky, 1971) in the hippocampus are sequentially activated. Later, during awake resting states and during sleep, compressed reactivation of the aforementioned sequences can be observed within events of high-frequency neural activity which are known as sharp wave/ ripples (Buzsáki, 1989; SWRs). These sequences preferentially start at the current position of the animal (Davidson et al., 2009) and can occur in the order observed during behavior as well as in the reverse order (Diba and Buzsáki, 2007). Consistent with the proposed function of replay in learning, Widloski and Foster, 2022 showed that in a goal-directed navigation task replay sequences obey spatial constraints when barriers in an environment change on a daily basis. By contrast, other studies suggest that replay is not limited to just reflecting the animal’s previous behavior. For instance, replay can represent shortcuts that the animals had never taken (Gupta et al., 2010). Replay during sleep appeared to represent trajectories through reward-containing regions that the animals had seen, but never explored (Ólafsdóttir et al., 2015). Following a foraging task, replay sequences resembled random walks, that is, their statistics was described by a Brownian diffusion process, even though the preceding behavioral trajectories did not (Stella et al., 2019). How can the hippocampus support the generation of such a variety of replay statistics and how do the sequences facilitate learning? Here, we address these questions by studying replay and its effect on learning using computational modeling. We adopt the reinforcement learning framework (RL), which formulates the problem of crediting choices with resulting consequences of an agent interacting with its environment and trying to maximize the expected cumulative reward (Sutton and Barto, 2018). A common way of solving this problem involves the learning of a so-called value function, which maps pairs of environmental states and actions to expected future rewards, and then choosing actions that yield the highest value. A popular algorithm that can learn such a function is Q-learning, in which the value function is referred to as the Q-function and expected future rewards are referred to as Q-values (Watkins, 1989). The Q-function is updated using the so-called temporal-difference (TD) error, which is computed from an experience’s immediate reward and the Q-function estimates of the future reward (see Materials and methods section for more details). While RL has traditionally been used to solve technical control problems, it has recently been adopted to model animal (Bathellier et al., 2013) and human (Redish et al., 2007; Zhang et al., 2018) behavior. RL generally requires many interactions with the environment, which results in slow and inefficient learning. Interestingly, replay of stored experiences, that is, interactions with the environment, greatly improves the speed of learning (Lin, 1992). The function of experience replay in RL has been linked to that of hippocampal replay in driving spatial learning (Johnson and Redish, 2005). Mattar and Daw, 2018 proposed a model of hippocampal replay as the replay of experiences which have the highest ‘utility’ from a reinforcement learning perspective, meaning that the experience which yields the highest behavioral improvement is reactivated. In their model, the agent’s environment is represented as a so-called grid world, which discretizes space into abstract states between which the agent can move using the four cardinal actions (Figure 1A). During experience replay, experiences associated with the different transitions in the environment can be reactivated by the agent. Their model accounts for multiple observed replay phenomena. However, it is unclear how the brains of animals could compute or even approximate the computations required by Mattar and Daw’s model. The brain would have to perform hypothetical updates to its network for each experience stored in memory, compute and store the utility for each experience, and then reactivate the experience with the highest utility. Since biological learning rules operate on synapses based on neural activity, it appears unlikely that hypothetical updates can be computed and their outcome stored without altering the network. Figure 1 Download asset Open asset The grid world. (A) An example of a simple grid world environment. An agent can transition between the different states, that is, squares in the grid, by moving in the four cardinal directions depicted by arrowheads. (B) Example trajectories of an agent moving in the grid world. (C) The successor representation (SR) for one state (green frame). Note that the SR depends on the agent’s actual behavior. (D) The default representation (DR) for the same state as in C. In contrast to the SR, the DR does not depend on the agent’s actual behavior and is equivalent to the SR given random behavior. Here, we introduce a model of hippocampal replay which is driven by experience and environmental structure, and therefore does not require computing the hypothetical update that a stored experience would lead to, if it were reactivated. Especially during early learning, the sequences generated by our proposed mechanism are often the optimal ones according to Mattar and Daw, and our replay mechanism facilitates learning in a series of spatial navigation tasks that comes close to the performance of Mattar and Daw’s model. Furthermore, we show that a variety of hippocampal replay statistics emerges from the variables that drive our model. Hence, our model could be seen as an approximation of Mattar and Daw’s model that avoids the computation of hypothetical updates at a small cost to learning performance. Results Using structural knowledge and the statistics of experience to prioritize replay We propose a model for the prioritization of experiences, which we call Spatial structure and Frequency-weighted Memory Access, or SFMA for short. The model was conceived with simplified grid world environments in mind (Figure 1A). Each node in the grid corresponds to an environmental state. During behavior a reinforcement learning agent transitions between nodes and stores these transitions as experience tuples et=(st,at,rt,st+1). Here, st is the current state, at is the action executed by the agent, rt is the reward or punishment received after transitioning, and st+1 is the next state. Reactivating stored experiences is viewed as analogous to the reactivation of place cells during replay events in the hippocampus. Assuming an experience et has just been reactivated, each stored experience e is assigned a priority rating R⁢(e|et) based on its strength C⁢(e), its similarity D⁢(e|et) to et and inhibition I⁢(e) applied to it (Figure 2A): (1) R(e|et)=C(e)D(e|et)[1−I(e)] Figure 2 with 3 supplements see all Download asset Open asset Illustration of the Spatial structure and Frequency-weighted Memory Access (SFMA) replay model. (A) The interaction between the variables in our replay mechanism. Experience strength C⁢(e), experience similarity D⁢(e|et) and inhibition of return I⁢(e) are combined to form reactivation prioritization ratings. Reactivation probabilities are then derived from these ratings. (B) Experience tuples contain two states: the current state and next state. In the default mode, the current state of the currently reactivated experience (violet circle) is compared to the current states of all stored experiences (blue arrows) to compute the experience similarity D⁢(e|et). In the reverse mode, the current state of the currently reactivated experience is compared to the next state of all stored experiences (red arrows). (C) Example of the similarity of experiences to the currently reactivated experience (green arrow) in an open field for the default and reverse modes. Experience similarity is indicated by the colorbar. In the default mode, the most similar experiences are the current state or those nearby. In the reverse mode, the most similar experiences are those that lead to the current state. The experience strength C⁢(e) is modulated by the frequency of experience and reward. The similarity D⁢(e|et) between e and et reflects the spatial distance between states, taking into account structural elements of the environment, such as walls and other obstacles. This rules out the use of Euclidean distances. A possible candidate would be the successor representation (SR) (Dayan, 1993). Given the current state state sj, the SR represents as a vector the likelihood of visiting the other states si in the near future (Figure 1C). This likelihood is commonly referred to as the discounted future occupancy. Since states separated by a barrier are temporally more distant to each other than to those without a barrier, their expected occupancy is reduced more by discounting. However, the SR suffers from two disadvantages. First, the SR depends on the agent’s current behavior, i.e., the agent’s policy, which will distort the distance relationships between states unless that agent behaves randomly. Second, if the structure of the environment changes, the SR has to be relearned completely, and this may be complicated further by the first problem. These problems were recently addressed by Piray and Daw, 2021 in the form of the default representation (DR) (Figure 1D). Unlike the SR, the DR does not depend on the agent’s policy, but rather on a default policy, which is a uniform action policy, that is, random behavior, and the DR was shown to reflect the distance relationships between states even in the presence of barriers. Importantly, Piray and Daw, 2021 demonstrated that the DR can be efficiently updated if the environmental structure changes using a low-rank update matrix. We chose to base experience similarity on the DR due to its advantageous features. Since experience tuples contain two states – the current state st and the next state st+1 – we consider two ways of measuring the similarity between experiences. In the default mode, we compare the current states of two experiences (Figure 2B). In the reverse mode, the current state of the most recently reactivated experience is compared to the next states of the other experiences. We found that default and reverse modes tend to generate different kinds of sequences. Two more replay modes could be defined in our model, which we will not consider in this study. We explain this choice in the Discussion. After an experience e has been reactivated, inhibition I⁢(e) is applied to prevent the repeated reactivation of the same experience. Inhibition is applied to all experiences sharing the same starting state st, i.e., inhibition is set to I⁢(e)=1, and decays in each time step by a factor of 0<λ<1. Inhibition values are maintained for one replay epoch and are reset when a new replay epoch is initiated. The next experience to be reactivated is randomly chosen according to the activation probabilities P⁢(e|et), which are computed by applying a customized softmax function on R⁢(e|et) (see Materials and methods). If the priority ratings for all experiences falls below a threshold θ=10-6 replay is stopped. The definition of one replay epoch is summarized in Algorithm 1. Sequences of replay are produced by iteratively activating and inhibiting individual experiences. Together, experience strengths and experience similarities guide replay while inhibition promotes the propagation of sequences (Figure 2—figure supplement 1). To initiate replay, we consider two cases: replay during awake resting states (online replay) is initiated at the agent’s current position, whereas replay during sleep (offline replay) is initiated with an experience randomly selected from memory based on the relative experience strengths. We chose different initialization schemes since awake replay has been reported to start predominantly at the animal’s current position (Davidson et al., 2009). However, there are also non-local sequences in awake replay (Karlsson and Frank, 2009; Gupta et al., 2010; Ólafsdóttir et al., 2017). Non-local replays are generated by our model in offline replay, albeit with a weaker bias for the current position (Figure 2—figure supplement 2A, B). To model awake replay, we could increase the bias in the random initialization by raising the experience strength for experiences associated with the current position (Figure 2—figure supplement 2C–F), while preserving some non-local replays. However, for simplicity we opted to simply initiate awake replay at the current location. Algorithm 1. Spatial structure and Frequency-weighted Memory Access (SFMA)Require: et (Replay initiated)1: for t=1:N do2: for experience e do3: Compute priority rating R(e|et)=C(e)D(e|et)[1−I(e)].4: end for5: if maxR<θ then6: Stop replay7: end if: Compute reactivation probabilities P(e|et).9: Choose next experience et+1 to reactivate.10: Reduce/decay inhibition for all stored experiences.11: Inhibit experience et+1.12: end for SFMA facilitates spatial learning We begin by asking what benefit a replay mechanism such as implemented in SFMA might have for spatial learning. To do so, we set up three goal-directed navigation tasks of increasing difficulty: a linear track, a square open field, and a labyrinth maze (Figure 3A). Simulations were run for the default and reverse replay modes and compared to an agent trained without experience replay, an agent trained with random experience replay, and the state-of-the-art Prioritized Memory Access model (PMA) by Mattar and Daw, 2018. Figure 3 with 4 supplements see all Download asset Open asset Statistics of replay has large impact on spatial learning. (A) The three goal-directed navigation tasks that were used to test the effect of replay on learning: linear track, open field and maze. In each trial, the agent starts at a fixed starting location S and has to reach the fixed goal location G. (B) Performance for different agents measured as the escape latency over trials. Shown is the performance for an online agent without replay (black), an agent that was trained with random replay (blue), our SFMA model (green), and the Prioritized Memory Access model (PMA) by Mattar and Daw, 2018 (red). The results of our SFMA model are further subdivided by the replay modes: default (solid), reverse (dash-dotted), and dynamic (dashed). Where the dashed and dash-dotted green lines are not visible they are overlapped by the red solid line. (C) Reverse and dynamic replay modes produce more optimal replays while the default replay mode yields pessimistic replays. Shown is the number of optimal updates in the replays generated on each trial for different replay modes: default (solid), reverse (dash-dotted), and dynamic (dashed). Note, that in later trials there is a lack of optimal updates because the learned policy is close to the optimal one and any further updates have little utility. (D) Directionality of replay produced by the default (solid) and reverse (dash-dotted) modes in the three environments. The reverse replay mode produces replays with strong reverse directionality irrespective of when replay was initiated. In contrast, the default mode produces replays with a small preference for forward directionality. After sufficient experience with the environment the directionality of replays is predominantly forward for replays initiated at the start and predominantly reverse for replays initiated at the end of a trial. Using the reverse mode, our model clearly outperformed the agents trained without replay and with random replay (Figure 3B). Importantly, performance was close to, even if slightly below, that of PMA. Learning performance of SFMA critically depends on the replay mode. The default mode yielded a performance that was much lower than the reverse mode and only slightly better than random replay. Considering its low performance, the default mode may appear undesirable, however, it is important for two reasons. First, it generates different types of sequences from the reverse mode (Figure 3D and Figure 3—figure supplement 1) and these sequences are more consistent with some experimental observations. While the reverse mode mostly generates reverse sequences in all situations, the default mode generates forward replay sequences at the beginning of trials and reverse sequences at the end of trials. This pattern of changing replay directions has been observed in experiments in familiar environments (Diba and Buzsáki, 2007). Also, as we will show below the default mode generates shortcut replays like those found by Gupta et al., 2010, whereas the reverse mode does not. Second, the recent discovery of so-called pessimistic replay, that is, replay of experiences which are not optimal from a reinforcement learning perspective, show that suboptimal replay sequences occur in human brains (Eldar et al., 2020) for a good reason (Antonov et al., 2022). Such suboptimal replays are better supported by the default mode (Figure 3C). We therefore suggest that the reverse and default modes play distinct roles during the learning progress. Early in the learning session or following reward changes, when the environment is novel and TD errors are high, the reverse mode is preferentially used to acquire a successful behavior quickly (Figure 3—figure supplement 3). Indeed, in our simulations the reverse mode produces more updates that were optimal, that is, they improved the agent’s policy the most, than did the default mode (Figure 3C). The preponderance of reverse replay sequences during early learning is consistent with experimental observations in novel environments (Foster and Wilson, 2006) or after reward changes (Ambrose et al., 2016). Later in the learning session, when the environment has become familiar and TD errors are low, the default mode is preferentially used to make the learned behavior more robust. The default mode then accounts for interspersed reverse and forward replay sequences in familiar environments. We provide a more detailed rationale for the default mode in the Discussion. We call this strategy of switching between the reverse and default modes the dynamic mode of SFMA. Put simply, in the dynamic mode the probability of generating replay using the reverse mode increases with the TD errors accumulated since the last trial (for more details see Materials and methods). It yields a learning performance (Figure 3B) and number of optimal updates (Figure 3C) that are similar to the reverse mode. In the following, we focus on the statistics of replay that SFMA generates in a range of different experimental settings. Near-homogeneous exploration of an open environment explains random walk replays We first investigated the replay statistics that our model produces in a simple environment without navigational goals similar to the experiment of Stella et al., 2019. To this end, we created a virtual square grid world environment of size 100 × 100 without rewards. For simplicity, we first set experience strengths to the same value, C⁢(e)=1, to reflect homogeneous exploration of the environment. All replays were generated using the default mode since the environment was familiar to the animal in the experiments, but using the reverse mode did not affect the results (Figure 4—figure supplement 1). With this setup the trajectories represented by the replayed experiences of the agent are visually similar to random walks (Figure 4A). The displacement distribution of replayed locations indicates that replay slowly diffuses from its starting location (Figure 4B and Figure 4—figure supplement 2). To analyze the trajectories more systematically, we used the Brownian Diffusion Analysis also used by Stella et al., 2019. In this analysis, a random walk is described by a power law relationship between the average distance between two replayed positions Δ⁢x and the time interval Δ⁢t, i.e., Δ⁢x=G⁢Δ⁢tα with α=0.5. Indeed, the simulated replays exhibit a linear relationship in the log-log-plot indicating a power law between the two variables (Figure 4C) and the slope is close to the theoretical value for a random walk α=0.5. This result is robust across a large range of model parameters, the most relevant are the DR’s discount factor γD⁢R and inhibition decay λ, and the range of values in our simulations, α∈[0.467,0.574] (Figure 4D), is a good match to the values reported by Stella et al., 2019. (α∈[0.45,0.53]). The values for the diffusion coefficient (Figure 4E), which relate to the reactivation speed, are similarly robust and only affected when the decay factor is close to zero. Hence, our model robustly reproduces the experimental findings. Figure 4 with 5 supplements see all Download asset Open asset Replays resemble random walks across different parameter values for Default Representation (DR) discount factor and inhibition decay. (A) Example replay sequences produced by our model. Reactivated locations are colored according to recency. (B) Displacement distributions for four time steps (generated with βM=5). (C) A linear relationship in the log-log plot between average distance of replayed experiences and time-step interval indicates a power-law. Lines correspond to different values of the DR’s discount factor γD⁢R as indicated by the legend. (D) The anomaly parameters (exponent α of power law) for different parameter values of DR and inhibition decay. Faster decay of inhibition, which allows replay to return to the same location more quickly, yields anomaly parameters that more closely resemble a Brownian diffusion process, that is, closer to 0.5. (E) The diffusion coefficients for different parameter values of DR and inhibition decay. Slower decay of inhibition yields higher diffusion coefficients. Since the results above were obtained assuming homogeneous exploration of the environment, we repeated our simulations with heterogeneous experience strengths (Figure 4—figure supplement 3B). While the relationships in the log-log-plot seems to slightly deviate from linear for very large time-step intervals, the statistics still largely resemble a random walk (Figure 4—figure supplement 3C). The range of values for α∈[0.410,0.539] still covers the experimental observations, albeit shifted toward smaller values. Stella et al. further reported that the starting locations of replay were randomly distributed across the environment and replay exhibited no preferred direction. Our model reproduces similar results in the case of homogeneous experience strengths (Figure 4—figure supplement 4A, B) and heterogeneous experience strengths (Figure 4—figure supplement 4C, D). The results of our simulations suggest that replay resembling random walks can be accounted for given near-homogeneous exploration of an open-field environment. If exploration is non-homogeneous, the statistics of a random walk hold only for short to medium time-step intervals. Stochasticity results in shortcut replays following stereotypical behavior Gupta et al., 2010 provided further evidence that replay does not simply reactivate previously experienced sequences, by showing that replay sequences sometimes represent trajectories that animals were prevented from taking. These so-called shortcut sequences were synthesized from previously experienced trajectories. We constructed a simplified virtual version of the Gupta et al. experiment (Figure 5A) to test whether, and under which conditions, our proposed mechanism can produce shortcut replays. In the experiment, animals exhibited very stereotypical behavior, that is, they ran laps in one direction and were prevented from running back. Therefore, in our model the agent was forced to run one of three predefined patterns: right laps, alternating laps, and left laps (Figure 5B). This allowed us to focus on the effect of different specific behavioral statistics on replay. The virtual agent was made to run 20 trials in one session and replays were simulated after each trial. Figure 5 with 3 supplements see all Download asset Open asset Replay of shortcuts results from stochastic selection of experiences and the difference in relative experience strengths. (A) Simplified virtual version of the maze used by Gupta et al., 2010. The agent was provided with reward at specific locations on each lap (marked with an R). Trials started at bottom of the center corridor (marked with an S). At the decision point (marked with a D), the agent had to choose to turn left or right in our simulations. (B) Running patterns for the agent used to model different experimental conditions: left laps, right laps and alternating laps. Replays were recorded at the reward locations. (C) Examples of shortcut replays produced by our model. Reactivated locations are colored according to recency. (D) The number of shortcut replays pooled over trials for different running conditions and values of the inverse temperature βM. Conditions: alternating-alternating (AA), right-left (RL), right-alternating (RA), and alternating-left (AL). (E) Learning performance for different replay modes compared to random replay. The agent’s choice depended on the Q-values at the decision point. The agent was rewarded for turning right during the first 100 trials after which reward shifted to the left (red line marks the shift). In the first 10 trials, the agents ran in one pattern, and in the next 10 trials, the agents used the same or a different pattern. To model Gupta et al.’s study, we ran simulations with the combinations right-left, right-alternating, and alternating-left. In addition, the combination alternating-alternating served as a baseline condition in which the left and right laps had roughly the same amount of experience throughout the simulation. In this case, only the default mode could reproduce the experimental findings, while the reverse mode could not. The main cause lay within the large dissimilarity between the experiences at the decision point (D in Figure 5A) and the experiences to the left and right of it, which in turn drove prioritization values to zero. We found that our model produces shortcut-like replays for each run pattern combination (Figure 5C and D). Shortcuts occurred in higher numbers for right-left and right-alternating combinations, and mainly in the last 10 trials, in trials when the agent was running a left lap, that is, every trial for right-left and every other trial for right-alternating (Figure 5—figure supplement 1). Across the whole running session, for alternating-alternating and alternating-left combinations a lower number of shortcuts occurred. This difference between combinations results from the balance of experience strengths associated with either of the laps and with the center piece. For right-alternating and right-left combinations the experience strengths associated with right lap and center piece were similar and therefore replay, when initiated at the left lap’s reward location, was not biased to reactivate experiences along the center piece. The number of shortcut replays then decreased as the experience strengths associated with the center piece increased relati

Full Text