Reinforcement Learning for the Agile Earth-Observing Satellite Scheduling Problem

Adam Herrmann,Hanspeter Schaub

doi:10.1109/taes.2023.3251307

Abstract

This work explores reinforcement learning (RL) for on-board planning and scheduling of an agile Earth-observing satellite (AEOS). In this formulation of the AEOS scheduling problem, a spacecraft in low-Earth orbit attempts to maximize the weighted sum of targets collected and downlinked. Reinforcement learning is both a class of problems and solution methods that involves learning how to map situations to actions to maximize a reward function through repeated interactions with an environment. Reinforcement learning problems are formulated as Markov decision processes (MDPs), which are formalizations of sequential decision making problems. In this work, the agile EOS Scheduling problem is formulated as a Markov decision process (MDP) where the number of upcoming imaging targets included in the action space is an adjustable parameter to account for clusters of imaging targets with varying priorities. Unlike prior Earth-observing satellite scheduling MDP formulations, this work explores how the size of the action space can be reduced to produce generalized policies that may be executed on board the spacecraft in seconds without sacrificing performance. Monte Carlo tree search (MCTS) and supervised learning are used to train a set of agents with varying numbers of targets in the action space. Monte Carlo tree search is an online search algorithm that was originally developed to solve two player games but has since been applied to solve reinforcement learning problems. Monte Carlo tree search solves reinforcement learning problems by simulating interactions with the environment, building an estimate of the state-action value function to select the next best action. In this work, Monte Carlo tree search is used to generate training data, and supervised learning is applied over the state-action value estimates generated by MCTS to solve for a generalized policy, which is used on-board the spacecraft to map states to actions. Two backup strategies are explored for MCTS - an incremental averaging operator and a maximization operator. For all backup operators, performance asymptotically increases as the number of targets in the action space approaches the maximum number of available targets. A benchmark is computed with Monte Carlo tree search to determine an upper bound on performance. Furthermore, MCTS is compared to solutions generated by a genetic algorithm. For all numbers of imaging targets in the action space, MCTS demonstrates a 2-5% increase in average reward at 10-20% of the single core wall clock time of the genetic algorithm. A search of various neural network hyperparameters is presented, and the trained neural networks are shown to approximate the MCTS policy with three orders of magnitude less execution time. Finally, the trained agents and the genetic algorithm are deployed on varying target densities for comparison purposes and to demonstrate robustness to mission profiles outside of the training distribution.

Full Text