Abstract

Virtual reality satellites give people an immersive experience of exploring space. The intelligent attitude control method using reinforcement learning to achieve multiaxis synchronous control is one of the important tasks of virtual reality satellites. In real‐world systems, methods based on reinforcement learning face safety issues during exploration, unknown actuator delays, and noise in the raw sensor data. To improve the sample efficiency and avoid safety issues during exploration, this paper proposes a new offline reinforcement learning method to make full use of samples. This method learns a policy set with imitation learning and a policy selector using a generative adversarial network (GAN). The performance of the proposed method was verified in a real‐world system (reaction‐wheel‐based inverted pendulum). The results showed that the agent trained with our method reached and maintained a stable goal state in 10,000 steps, whereas the behavior cloning method only remained stable for 500 steps.

Highlights

  • Virtual reality satellites enable people to explore space using any mobile, desktop, or virtual reality device

  • A state S, it corresponds to the policy set π1, π2, ⋯, πK, which is used to form the output action vector A = a1, a2, ⋯, aK, where aiði = 1, 2, ⋯, KÞ corresponds to the action proposal given by the i-th policy, and the policy selector takes ðs, a1Þ, ðs, a2Þ, ⋯, ðs, aKÞ as inputs to obtain the scoring vector to judge whether each action pair is an expert behavior V = Dðs, a1Þ, Dðs, a2Þ, ⋯, Dðs, aKÞ

  • Analysis of the results showed that the reason the proposed method achieved a better performance was that by dividing the expert trajectories, each policy in the policy set was trained for a different initial state set

Read more

Summary

Introduction

Virtual reality satellites enable people to explore space using any mobile, desktop, or virtual reality device. To improve the performance of the policy during the training iterations, model-free methods often require a large number of samples to learn and are prone to problems of high collection times and costs caused by the low sampling efficiency. In terms of how to effectively use the model, the first type of method does not consider the cumulative error It only selects one action, such as random shot, model predictive control (MPC), or cross entropy method (CEM), and selects the current optimal action by simulating multiple paths in the learned environment. This type of method often requires powerful computing resources. The innovations of this paper are as follows. (1) This paper proposes a new imitation learning method to solve the continuous control problem. (2) Through training, a lowlevel control policy for a real system with a reverse momentum wheel as the actuator was obtained. (3) By only using the onboard original sensor information to achieve the endto-end control of the multimotor pulse width modulation (PWM) signals, the control task of balancing on a corner was completed

Methodology
Experiments and Results
A Policy sets
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.