Policy Iteration Based on a Learned Transition Model

Vivek Ramavajjala,Charles Elkan

doi:10.1007/978-3-642-33486-3_14

Abstract

AbstractThis paper investigates a reinforcement learning method that combines learning a model of the environment with least-squares policy iteration (LSPI). The LSPI algorithm learns a linear approximation of the optimal state-action value function; the idea studied here is to let this value function depend on a learned estimate of the expected next state instead of directly on the current state and action. This approach makes it easier to define useful basis functions, and hence to learn a useful linear approximation of the value function. Experiments show that the new algorithm, called NSPI for next-state policy iteration, performs well on two standard benchmarks, the well-known mountain car and inverted pendulum swing-up tasks. More importantly, the NSPI algorithm performs well, and better than a specialized recent method, on a resource management task known as the day-ahead wind commitment problem. This latter task has action and state spaces that are high-dimensional and continuous.KeywordsAction SpaceWind FarmTransition ModelMarkov Decision ProcessInverted PendulumThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Policy Iteration Based on a Learned Transition Model

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Piecewise linear value function approximation for factored MDPs
...
-
, et. al. ...
28 Jul 2002
28 Jul 2002

Adaptive value function approximations in classifier systems
Lashon B Booker
-
Lashon B BookerLashon B Booker
25 Jun 2005
25 Jun 2005

Heuristic Dynamic Programming Nonlinear Optimal Controller
...
-
, et. al. ...
01 Jan 2009
01 Jan 2009

Approximate Dynamic Programming for Large Scale Systems

-

01 Jan 2012
01 Jan 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Policy Iteration Based on a Learned Transition Model

Abstract

Talk to us

Similar Papers