Abstract

Decision making is widespread in many domains, including aircraft collision avoidance, autonomous driving and space exploration. Except for highly engineered environments, uncertainties are unavoidable and can not be ignored. It challenges the decision maker to figure out rational actions and solving such problems in large and complex scenarios are even harder. This thesis provides new perspectives to alleviate the difficulty of solving large-scale decision making problems under uncertainty.The first decision making model considered is the stochastic Multi-Armed Bandit (MAB). It is an important model for studying the exploration-exploitation tradeoff. In this problem, a gambler has to repeatedly choose between a number of slot machines (arms) to maximize the total payout. The outcome of playing arms is stochastic and the total number of plays is fixed. To solve large-scale MABs, we introduce a method, called Cross-Entropy-based Multi-Armed Bandit (CEMAB), adopting the Cross-Entropy method as a noisy optimizer. Various MAB testing cases are used to compare the performance of CEMAB and different competitors. The results show that CEMAB is promising when the size of the arm space is large.Making principled decisions in the presence of uncertainty is often facilitated by using the powerful framework of Partially Observable Markov Decision Processes (POMDPs). However, precisely because of its generality, solving this problem exactly is computationally intractable. Recently, approximate POMDP solvers have shown to be able to compute good decision strategies, but handling POMDPs with large action spaces remains difficult. We propose a sampling method called Quantile-Based Action Selector (QBASE) that can scale up to very large problems. We employ several scalable robotics scenarios with up to 100,000 actions to evaluate the performance of the proposed technique. Based on numerical experiments, QBASE performs significantly better than POMCP, a state-of-the-art solver, when the size of action space is large (>100).Finding the best performance of POMDP solvers involves parameter optimization. It is usually done by searching for the good performing settings off-line, which unavoidably adds an extra burden to users. We extend QBASE to identify parameters automatically within the planning time, called Adaptive Parameter Sampling (APS-QBASE). Two tasks with up to one million possible actions are used in numerical experiments. We find that APS-QBASE can achieve a higher policy quality than several on-line POMDP approaches with different action selectors, including QBASE and an enhancement of POMCP for handing large action spaces. A sensitivity study of APS-QBASE suggests that the proposed method can significantly reduce the difficulty of setting good parameters.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call