Abstract
We study the problem of stochastic multiple-arm identification, where an agent sequentially explores a size- subset of arms (also known as a super arm) from given arms and tries to identify the best super arm. Most work so far has considered the semi-bandit setting, where the agent can observe the reward of each pulled arm or assumed each arm can be queried at each round. However, in real-world applications, it is costly or sometimes impossible to observe a reward of individual arms. In this study, we tackle the full-bandit setting, where only a noisy observation of the total sum of a super arm is given at each pull. Although our problem can be regarded as an instance of the best arm identification in linear bandits, a naive approach based on linear bandits is computationally infeasible since the number of super arms is exponential. To cope with this problem, we first design a polynomial-time approximation algorithm for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we propose a bandit algorithm whose computation time is (log ), thereby achieving an exponential speedup over linear bandit algorithms. We provide a sample complexity upper bound that is still worst-case optimal. Finally, we conduct experiments on large-scale data sets with more than 10 super arms, demonstrating the superiority of our algorithms in terms of both the computation time and the sample complexity.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.