We consider a principle or controller that can pick actions from a fixed action set to control an evolving system with converging dynamics. The converging dynamics means that, if the principle holds the same action, the system will asymptotically converge to a unique stable state determined by this action. In our model, the dynamics of the system are unknown to the principle, and the principle can only receive bandit feedback (maybe noisy) on the impacts of his actions. The principle aims to learn which stable state yields the highest reward while adhering to specific constraints and to immerse the system into this state as quickly as possible. We measure the principle's performance in terms of regret and constraint violation. In cases where the action set is finite, we propose an algorithm Optimistic-Pessimistic Convergence and Confidence Bounds (OP-C2B) that ensures sublinear regret and constraint violation simultaneously. Particularly, OP-C2B achieves logarithmic regret and constraint violation when the system convergence rate is linear or superlinear. Furthermore, we generalize our algorithm OP-C2B to the case of an infinite action set and demonstrate its ability to maintain sublinear regret and constraint violation.
Read full abstract