Abstract

In this work, we investigate the utilization of deep approximate policy iteration (DAPI) in estimating the optimal action-value function Q* within the context of reinforcement learning, employing rectified linear unit (ReLU) ResNet as the underlying framework. The iterative process of DAPI incorporates the minimax average Bellman error minimization principle. It employs ReLU ResNet to estimate the fixed point of the Bellman equation, which is aligned with the estimated greedy policy. Through error propagation, we derive nonasymptotic error bounds between Q* and the estimated Q function induced by the output greedy policy in DAPI. To effectively control the Bellman residual error, we address both the statistical and approximation errors associated with the α -mixing dependent data derived from Markov decision processes, using the techniques of empirical process and deep approximation theory, respectively. Furthermore, we present a novel generalization bound for ReLU ResNet in the presence of dependent data, as well as an approximation bound for ReLU ResNet within the Hölder class. Notably, this approximation bound contributes to a significant improvement in the dependence on the ambient dimension, transitioning from an exponential relationship to a polynomial one. The derived nonasymptotic error bounds explicitly depend on factors such as the sample size, the ambient dimension (in polynomial terms), and the width and depth of the neural networks. Consequently, these bounds serve as valuable theoretical guidelines for appropriately setting the hyperparameters, thereby enabling the achievement of the desired convergence rate during the training process of DAPI.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call