Abstract

Policy gradient, which makes use of Monte Carlo method to get an unbiased estimation of the parameter gradients, has been widely used in reinforcement learning. One key issue in policy gradient is reducing the variance of the estimation. From the viewpoint of statistics, policy gradient with baseline, a successful variance reduction method for policy gradient, directly applies the control variates method, a traditional variance reduction technique used in Monte Carlo, to policy gradient. One problem with control variates method is that the quality of estimation heavily depends on the choice of the control variates. To address the issue and inspired by the antithetic variates method for variance reduction, we propose to combine the antithetic variates method with traditional policy gradient for the multi-armed bandit problem. Furthermore, we achieve a new policy gradient algorithm called Antithetic-Arm Bandit (AAB). In AAB, the gradient is estimated through coordinate ascent where at each iteration gradient of the target arm is estimated through: 1) constructing a sequence of arms which is approximately monotonic in terms of estimated gradients, 2) sampling a pair of antithetic arms over the sequence, and 3) re-estimating the target gradient based on the sampled pair. Theoretical analysis proved that AAB achieved an unbiased and variance reduced estimation. Experimental results based on a multi-armed bandit task showed that AAB can achieve state-of-the-art performances.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call