SOM-based Algorithm for Multi-armed Bandit Problem

Nobuhito Manome,Kosuke Tomonaga,Shunji Mitsuyoshi,Kouta Suzuki,Shuji Shinohara

doi:10.1109/ijcnn.2019.8851819

Abstract

To satisfy a user facing a communication robot having various behavior options, it is necessary to output the behavior that is most suitable for the user faster. Such problems are formulated as a multi-armed bandit problem. The multi-armed bandit problem refers to the problem of maximizing gain in a situation with multiple arms where, by pulling a lever, a reward can be obtained from the arms with a certain probability, and the challenge is to select the arm that results in the maximum reward. Considering the behavior options of the communication robot as the arms, and the user satisfaction as the reward, under a condition with even more arms, it is desirable to select the arm that produces the maximum reward faster. This study proposes a new algorithm using a self-organizing map to solve the multi-armed bandit problem. Moreover, multiple numerical experiments have been conducted for the stochastic bandit problem, and it has been demonstrated that the proposed method is capable of selecting the arm with higher reward in a situation with even more arms faster, compared with the existing representative algorithms such as UCB1, UCB1-tuned, and Thompson sampling.

Full Text