Abstract

The Multi-Arm Bandit problem is becoming increasingly popular as it enables real-world sequential decision making across application domains, including clinical trials, recommender systems, and online decision making. The Multi-Arm Bandit problem is a classical problem of exploration and exploitation dilemma in reinforcement learning and it needs to decide on the optimal strategy based on the reward situation of each rocker arm. However, the existing Multi-Arm Bandit algorithms have many shortcomings, such as blind exploration, weak generalization ability and endless exploration. Aiming at the shortcomings of the existing Multi-Armed Bandit algorithms, this paper proposes a Multi-Armed Bandit algorithm based on variance change sensitivity. The algorithm takes the variance change of reward as a clue, and adjusts the exploration probability by the average variance change of all actions, and selects the action with the largest variance change when exploring. At the same time, in order to reduce the waste of action selection times and maximize the cumulative reward, A parameter Ncon was introduced to record the number of consecutive selections of the same action. The exploration stops when Ncon reaches a certain value. Experiments show that the exploration algorithm can obtain higher reward value and lower regret value in the end.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.