Abstract

The increasing number of space debris is a critical impact on space environment. Active multi-debris removal (ADR) mission planning technique with maximal reward objective is getting more attention. As the goal of Reinforcement Learning (RL) is in accordance with maximal-reward optimization model of ADR, the planning will be more efficient with the advanced RL scheme and RL algorithm. In this paper, first, an RL formulation is presented for the ADR mission planning problem. All the basic components of maximal-reward optimization model are recast in RL scheme. Second, a modified Upper Confidence bound Tree (UCT) search algorithm for the ADR planning task is developed, which both leverages the neural-network-assisted selection and expansion procedures to facilitate exploration and incorporates roll-out simulation in the backup procedure to achieve robust value estimation. This algorithm fits the RL scheme of ADR mission planning and better balances the exploration and exploitation. Experimental comparison using three subsets of Iridium 33 debris cloud data reveals a better performance of this modified UCT over previously reported results and close UCT variants.

Highlights

  • Low Earth orbits (LEO) space debris poses serious threats [1], [2] on future on-orbit missions [3]

  • A reinforcement learning framework is proposed for solving the Active (multiple) Debris Removal (ADR) mission planning problem

  • The components of maximal-reward optimization model are cast into Reinforcement Learning (RL) scheme

Read more

Summary

INTRODUCTION

Low Earth orbits (LEO) space debris poses serious threats [1], [2] on future on-orbit missions [3]. In high-level ADR mission planning, the objective is only influenced by the actions at each state It is the core of the Markov Decision Process (MDP) [30]–[32], which means the RL model is applicable to solve this problem. The dynamic model is assumed to be under J2 perturbation [8], [18] and the drift orbit transfer strategy [10]–[12], [15] is utilized to access to the impulse cost Under this circumstance, the high-level planning is possible to be solved as the reference for future rendezvous planning and on-orbit operation, and all the processes of the planning are regarded to be deterministic functions, such as time-dependent cost function, state transition function, and reward function. Where ap(aq), Ip(Iq) are the semi-major axis and inclination of debris #p(#q), is the RAAN change of OTV to complete in the time interval tp to tq

REINFORCEMENT LEARNING FORMULATION
BASIC DEFINITION
MODIFIED UCT ALGORITHM
12: Pick the action from legal actions based on the policy
EXPERIMENT
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call