Abstract

Satisfying safety constraints is the top priority in safe reinforcement learning (RL). However, without proper exploration, an overly conservative policy such as freezing at the same position can be generated. To this end, we utilize maximum entropy RL methods for exploration. In particular, an RL method with Tsallis entropy maximization, called Tsallis actor-critic (TAC), is used to synthesize policies which can explore with more promising actions. In this paper, we propose a Tsallis entropy-regularized safe RL method for safer exploration, called SafeTAC. For more expressiveness, we extend the TAC to use a Gaussian mixture model policy, which improves the safety performance. To stabilize the training process, the retrace estimators for safety critics are formulated, and a safe policy update rule using a trust region method is proposed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call