Abstract
We consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. For these problems, we propose the Stochastic Polychotomy (SP) algorithms, and derive finite-time upper bounds on its regret and optimization error. We show that, for a class of reward functions, the SP algorithm achieves a regret and an optimization error with optimal scalings, i.e., $O(\sqrtT )$ and $O(1/\sqrtT )$ (up to a logarithmic factor), respectively. SP constitutes the first order-optimal algorithm for non-smooth expected reward functions, as well as for smooth functions with unknown smoothness. The algorithm is based on sequential statistical tests used to successively trim an interval that contains the best arm with high probability. These tests exhibit a minimal sample complexity which confers to SP its adaptivity and optimality. Numerical experiments actually reveal that the algorithm even outperforms state-of-the-art algorithms that exploit the knowledge of the smoothness of the reward function. The performance of SP is further illustrated on the problem of setting optimal reserve prices in repeated second-price auctions; there, the algorithm is evaluated on real-world data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Proceedings of the ACM on Measurement and Analysis of Computing Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.