Abstract
In reinforcement learning (RL), dealing with non-stationarity is a challenging issue. However, some domains such as traffic optimization are inherently non-stationary. Causes for and effects of this are manifold. In particular, when dealing with traffic signal controls, addressing non-stationarity is key since traffic conditions change over time and as a function of traffic control decisions taken in other parts of a network. In this paper we analyze the effects that different sources of non-stationarity have in a network of traffic signals, in which each signal is modeled as a learning agent. More precisely, we study both the effects of changing the context in which an agent learns (e.g., a change in flow rates experienced by it), as well as the effects of reducing agent observability of the true environment state. Partial observability may cause distinct states (in which distinct actions are optimal) to be seen as the same by the traffic signal agents. This, in turn, may lead to sub-optimal performance. We show that the lack of suitable sensors to provide a representative observation of the real state seems to affect the performance more drastically than the changes to the underlying traffic patterns.
Highlights
Controlling traffic signals is one way of dealing with the increasing volume of vehicles that use the existing urban network infrastructure
Our main goal with the following experiments is to quantify the impact of different causes of non-stationarity in the learning process of an reinforcement learning (RL) agent in traffic signal control
We first conduct an experiment where traffic signals use a fixed control policy—a common strategy in case the infrastructure lacks sensors and/or actuators. The results of this experiment are discussed in “Traffic Signal Control under Fixed Policies” and are used to emphasize the problem of lacking a policy that can adapt to different contexts; it serves as a baseline for later comparisons
Summary
Controlling traffic signals is one way of dealing with the increasing volume of vehicles that use the existing urban network infrastructure. RL is based on an agent computing a policy mapping states to actions without requiring an explicit environment model. In reinforcement learning (Sutton & Barto, 1998), an agent learns how to behave by interacting with an environment, from which it receives a reward signal after each action. The agent uses this feedback to iteratively learn an optimal control policy π ×a function that specifies the most appropriate action to take in each state. We can model RL problems as Markov decision processes (MDPs) These are described by a set of states S, a set of actions A, a reward function Rðs; a; s0Þ ! In an infinite horizon MDP, the cumulative reward in the future
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.