HMCTS-OP: Hierarchical MCTS Based Online Planning in the Asymmetric Adversarial Environment

Lina Lu,Wanpeng Zhang,Xueqiang Gu,Jing Chen,Xiang Ji

doi:10.3390/sym12050719

Abstract

The Monte Carlo Tree Search (MCTS) has demonstrated excellent performance in solving many planning problems. However, the state space and the branching factors are huge, and the planning horizon is long in many practical applications, especially in the adversarial environment. It is computationally expensive to cover a sufficient number of rewarded states that are far away from the root in the flat non-hierarchical MCTS. Therefore, the flat non-hierarchical MCTS is inefficient for dealing with planning problems with a long planning horizon, huge state space, and branching factors. In this work, we propose a novel hierarchical MCTS-based online planning method named the HMCTS-OP to tackle this issue. The HMCTS-OP integrates the MAXQ-based task hierarchies and the hierarchical MCTS algorithms into the online planning framework. Specifically, the MAXQ-based task hierarchies reduce the search space and guide the search process. Therefore, the computational complexity is significantly reduced. Moreover, the reduction in the computational complexity enables the MCTS to perform a deeper search to find better action in a limited time. We evaluate the performance of the HMCTS-OP in the domain of online planning in the asymmetric adversarial environment. The experiment results show that the HMCTS-OP outperforms other online planning methods in this domain.

Highlights

It is challenging to solve large-scale planning problems
We model the online planning problem in the asymmetric adversarial environment as an Markov Decision Process (MDP) and extend the MDP to the semi-Markov decision process (SMDP) by introducing the task hierarchies
The HMCTS-OP integrates the MAXQ-based task hierarchies and hierarchical Monte Carlo Tree Search (MCTS) algorithms into the online planning framework

Summary

Introduction

It is challenging to solve large-scale planning problems. These problems suffer from the “curse of dimensionality”. Online planning algorithms (e.g., the Monte Carlo Tree Search (MCTS)) overcome this problem by avoiding calculation of the complete policy for the whole state space. The Markov Decision Process (MDP) is a fundamental formalism for learning and planning problems. It is defined by a four-tuple (S, A, R, P) with states S, actions A, a transition function P, and a reward function R. P(s0 s, a) represents the probability of performing action a in state s and transitioning to state s0. R(s0 s, a) ∈ R is the reward for the state transition from s to s’ after performing action a. The probability of selecting action a in state s is represented as π(a s)

Methods

Results

Conclusion