Markov Decision Process Parallel Value Iteration Algorithm On GPU

Lu Lu,Peng Chen

doi:10.2991/isca-13.2013.51

Abstract

This paper defines an Out Of Play model based on Markov Decision Process. The best path for playing can be found and recommended by using this model, and a value iteration algorithm of Markov Decision Process is used to implement the model. In this paper, the implementation of this model with CPU is presented. And then, in order to improve the performance of the value iteration algorithm, a parallel value iteration algorithm on GPU is designed and showed. For the calculation of a large amount of data, the experimental results show that the parallel value iteration algorithm on GPU improves performance far more than that of the serial value iteration algorithm on CPU. Introduction Graphic Processing Unit (GPU) attracts more and more attention in general-purpose computing with the development of graphics hardware. But now, GPU is not only used in graphic, it is also considered as a powerful technique for obtaining inexpensive, high performance parallelism [1, 2]. General-Purpose GPU is a highly parallel, multithreaded, many-core processor with a very high computational power and memory bandwidth [3]. GPU architecture is designed for optimization of massively parallel computing, because of this architectural difference, a GPU is in general more advantageous for large-scale parallel data processing applications than general-purpose CPUs[4], and the high performance computing community leveraging a GPU can yield performance increases of several orders of magnitude[5,6]. High-performance computing with GPUs is called GPU computing [7]. So using GPU for parallel computing will become a new focus for the purpose of speeding up the calculation. Markov Decision process is a stochastic dynamic system based on the theory of Markov process and decision-making process. In Markov decision process, the ultimate goal is to find an action for every state so that the performance of the system is the best. In this paper, in order to reach higher performance, a parallel implementation on GPU is given, and OpenCL is selected to program. Markov Decision Process Markov Model. Markov Decision process [8][9] can be defined as a four tuple . In the tuple, S is a finite set of states, and A is a finite set of actions; T is probability distribution ( means the probability of transition from state to state by taking action a); R is the reward function, and means the reward got when taking action from state to state . In MDP, there is a parameter which is a discount factor and is used to reduce the interference from the future actions. Value Iteration Algorithm. In order to solve Markov model, there are many algorithms proposed. In this paper, the value iteration algorithm of MDP is selected for studying. The main idea of value iteration algorithm [8] is iteration. In the algorithm, the target is to find the optimal policy via the optimal value iteration. First, we give an initial value function for every state, then we update the value function of every state to a next value function for every iteration until satisfying a condition. Fig 1 [8] is the pseudo of value iteration algorithm. International Conference on Information Science and Computer Applications (ISCA 2013) © 2013. The authors Published by Atlantis Press 299 Fig.1: The pseudo of value iteration algorithm Out of Play Model. According to Markov Model and the value iteration algorithm, this paper gives a MDP model—Out Of Play model. This model can be described as follows: Go out for playing, but we can’t decide where to go, or we don’t know which transportation (how to go: bus or walk) can be chosen to reach the destination. In fact, this can be described as a MDP. In the scene, choosing where to go and how to go are random and uncertain, and according to the uncertain destination and mode of transportation, a MDP model can be created. In the model, the different destinations can be described as the finite set of states, and the different transportations can be described as the finite set of actions. Every state has an initial reward which represents the fun level. Every state can take different actions to get to another state, and this meets a probability distribution. In the model, a satisfied optimal path must be found. Fig 2 is the whole system model.

Full Text