Abstract

Recent years have witnessed a renewed interest in reinforcement learning (RL) due to the rapid growth of the Internet-of-Things (IoT) and their associated intelligent information processing and decision-making demands. As the slow learning speed is one of the major stumbling blocks of the classic RL algorithms, substantial efforts have been devoted to developing faster RL algorithms. Among them, post-decision state (PDS) learning is a prominent one, which can often improve the learning speed by orders of magnitude by exploiting the structural property of the underlying Markov decision processes (MDPs). However, conventional PDS learning requires prior information about the PDS transition probability, which may not be always available in practice. To lift this limitation, a novel blind PDS (b-PDS) learning algorithm is proposed in this work by leveraging the generic two-timescale stochastic approximation framework. By introducing an extra estimating procedure about the PDS transition probability, b-PDS learning can achieve a similar improvement of learning speed as conventional PDS learning while excluding the need for prior information. In addition, by analyzing the globally asymptotically stable equilibrium of the corresponding ordinary differential equation (o.d.e.), the convergence and optimality of b-PDS learning are established. Moreover, extensive simulation results are provided to validate the effectiveness of the proposed algorithm. Over the considered random MDPs, it has been observed that, to reach 90% of the best possible time average reward, the proposed b-PDS learning can reduce the learning time by 70% compared to Q-learning and 30% compared to Dyna.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call