Abstract

We consider a principle or controller that can pick actions from a fixed action set to control an evolving system with converging dynamics. The converging dynamics means that, if the principle holds the same action, the system will asymptotically converge to a unique stable state determined by this action. In our model, the dynamics of the system are unknown to the principle, and the principle can only receive bandit feedback (maybe noisy) on the impacts of his actions. The principle aims to learn which stable state yields the highest reward while adhering to specific constraints and to immerse the system into this state as quickly as possible. We measure the principle's performance in terms of regret and constraint violation. In cases where the action set is finite, we propose an algorithm Optimistic-Pessimistic Convergence and Confidence Bounds (OP-C2B) that ensures sublinear regret and constraint violation simultaneously. Particularly, OP-C2B achieves logarithmic regret and constraint violation when the system convergence rate is linear or superlinear. Furthermore, we generalize our algorithm OP-C2B to the case of an infinite action set and demonstrate its ability to maintain sublinear regret and constraint violation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.