Abstract

By exploiting ultrafast and irregular time series generated by lasers with delayed feedback, we have previously demonstrated a scalable algorithm to solve multi-armed bandit (MAB) problems utilizing the time-division multiplexing of laser chaos time series. Although the algorithm detects the arm with the highest reward expectation, the correct recognition of the order of arms in terms of reward expectations is not achievable. Here, we present an algorithm where the degree of exploration is adaptively controlled based on confidence intervals that represent the estimation accuracy of reward expectations. We have demonstrated numerically that our approach did improve arm order recognition accuracy significantly, along with reduced dependence on reward environments, and the total reward is almost maintained compared with conventional MAB methods. This study applies to sectors where the order information is critical, such as efficient allocation of resources in information and communications technology.

Highlights

  • By exploiting ultrafast and irregular time series generated by lasers with delayed feedback, we have previously demonstrated a scalable algorithm to solve multi-armed bandit (MAB) problems utilizing the time-division multiplexing of laser chaos time series

  • The present study relates to the application of laser chaos to a multi-armed bandit problem (MAB)[6]

  • We have defined the reward, regret, and correct order rate (COR) as metrics to quantitatively evaluate the performance of the method

Read more

Summary

Introduction

By exploiting ultrafast and irregular time series generated by lasers with delayed feedback, we have previously demonstrated a scalable algorithm to solve multi-armed bandit (MAB) problems utilizing the time-division multiplexing of laser chaos time series. The present study relates to the application of laser chaos to a multi-armed bandit problem (MAB)[6]. Reinforcement learning (RL), a branch of machine learning along with supervised and unsupervised learning, studies optimal decision-making rules It differs from other machine learning tasks (e.g. image recognition) as the notion of reward comes into play in RL. The MAB is a sequential decision problem of maximizing total rewards where there are K(> 1) arms, or selections, whose reward probability is unknown. An algorithm for the MAB using laser chaos time series has been proposed in ­20186 This algorithm sets two goals: to maximize the total rewards and to identify the best arm. Should we have multiple channel users, not all users can use the best channel simultaneously; Scientific Reports | (2021) 11:4459

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call