Abstract

To satisfy a user facing a communication robot having various behavior options, it is necessary to output the behavior that is most suitable for the user faster. Such problems are formulated as a multi-armed bandit problem. The multi-armed bandit problem refers to the problem of maximizing gain in a situation with multiple arms where, by pulling a lever, a reward can be obtained from the arms with a certain probability, and the challenge is to select the arm that results in the maximum reward. Considering the behavior options of the communication robot as the arms, and the user satisfaction as the reward, under a condition with even more arms, it is desirable to select the arm that produces the maximum reward faster. This study proposes a new algorithm using a self-organizing map to solve the multi-armed bandit problem. Moreover, multiple numerical experiments have been conducted for the stochastic bandit problem, and it has been demonstrated that the proposed method is capable of selecting the arm with higher reward in a situation with even more arms faster, compared with the existing representative algorithms such as UCB1, UCB1-tuned, and Thompson sampling.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.