Abstract

AbstractContextual multi-armed bandit algorithms are widely used to solve online decision-making problems. However, traditional methods assume linear rewards and low dimensional contextual information, leading to high regrets and low online efficiency in real-world applications. In this paper, we propose a novel framework called interconnected neural-linear UCB (InlUCB) that interleaves two learning processes: an offline representation learning part, to convert the original contextual information to low-dimensional latent features via non-linear transformation, and an online exploration part, to update a linear layer using upper confidence bound (UCB). These two processes produce an effective and efficient strategy for online decision-making problems with non-linear rewards and high dimensional contexts. We derive a general expression of the finite-time cumulative regret bound of InlUCB. We also give a tighter regret bound under certain assumptions on neural networks. We test InlUCB against state-of-the-art bandit methods on synthetic and real-world datasets with non-linear rewards and high dimensional contexts. Results demonstrate that InlUCB significantly improves the performance on cumulative regrets and online efficiency.KeywordsContextual banditsUpper confidence boundNeural networksRegret bound

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.