Abstract

The interpretability of policies remains an important challenge in Deep Reinforcement Learning (DRL). This paper explores interpretable DRL via representing policy by Differentiable Inductive Logic Programming (DILP) and provides a theoretical and empirical study of DILP-based policy learning from an optimization perspective. We first identified a fundamental fact that DILP-based policy learning should be solved as a constrained policy optimization problem. We then proposed to use Mirror Descent for policy optimization (MDPO) to deal with the constraints of DILP-based policies. We derived the closed-form regret bound of MDPO with function approximation, which is helpful to the design of DRL frameworks. Moreover, we studied the convexity of DILP-based policy to further verify the benefits gained from MDPO. Empirically, we experimented MDPO, its on-policy variant, and 3 mainstream policy learning methods, and the results verified our theoretical analysis.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call