A Method Integrating Q-Learning With Approximate Dynamic Programming for Gantry Work Cell Scheduling

Xinyan Ou,Qing Chang,Nilanjan Chakraborty

doi:10.1109/tase.2020.2984739

Abstract

This article formulates gantry real-time scheduling in a gantry work cell, where the material transfer is driven by gantries, as a Markov decision process (MDP). Classical learning methods and planning methods for solving the optimization problems in MDP are discussed. An innovative method, called “Q-ADP,” is proposed to integrate reinforcement learning (RL) with approximate dynamic programming (ADP). Q-ADP uses model-free Q-learning algorithm to learn state values through interactions with the environment, meanwhile, planning steps during the learning process opt for ADP to keep updating state values through several sample paths. A model of one-step transition probabilities is built based on the machines’ reliability model, and serves the ADP algorithm. To demonstrate the effectiveness of this method, a numerical study is performed to show the production performance, compared to a standard Q-learning algorithm. The simulation results show that Q-ADP outperforms standard Q-learning under the same length of training process. It is also shown that with the benefit of repeated updating state values through sample paths, Q-ADP requires less data for gantry policy to converge, which makes the method promising when real data are limited. Note to Practitioners —The goal of this work is to find a near optimal gantry assignment policy to realize real-time control of material handling gantry/robot movements in gantry work cells. Properly assigning gantries based on real-time situations of the production system can avoid machines’ stoppage due to material shortage, and consequently improve production performance. This gantry scheduling is a sequential decision-making problem and can be presented by Markov Decision Process (MDP). To solve the MDP problem, an algorithm integrating model-free Q-learning and model-based approximate dynamic programming (ADP) is proposed. By learning directly from the interaction with the environment, the method avoids bias problem from any model designing. Meanwhile, a planning process during learning can efficiently speed up the learning for convergence of the policy, and this particularly benefits to the scenario when the real data are insufficient.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Automation Science and Engineering	Publication Date: Apr 17, 2020
Citations: 52	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

A Method Integrating Q-Learning With Approximate Dynamic Programming for Gantry Work Cell Scheduling

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Automation Science and Engineering

Lead the way for us

Similar Papers

Approximate Dynamic Programming with (min; +) linear function approximation for Markov decision processes
L Chandrashekar ... Shalabh Bhatnagar
-
L Chandrashekar, et. al.L Chandrashekar ... Shalabh Bhatnagar
01 Dec 2014
01 Dec 2014

Solving Markov Decision Processes via Simulation
Abhijit Gosavi
-
Abhijit GosaviAbhijit Gosavi
18 Sep 2014
18 Sep 2014

Continuous-time Markov decision process with average reward: Using reinforcement learning method
Shengde Jia ... Lincheng Shen
-
Shengde Jia, et. al.Shengde Jia ... Lincheng Shen
01 Jul 2015
01 Jul 2015

Model-building semi-Markov adaptive critics
Abhijit Gosavi ... Susan L Murray
-
Abhijit Gosavi, et. al.Abhijit Gosavi ... Susan L Murray
01 Apr 2011
01 Apr 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Method Integrating Q-Learning With Approximate Dynamic Programming for Gantry Work Cell Scheduling

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Automation Science and Engineering