We propose a minimax concave penalized multiarmed bandit algorithm under the generalized linear model (G-MCP-Bandit) for decision-makers facing high-dimensional data in an online learning and decision-making environment. We demonstrate that in the data-rich regime, the G-MCP-Bandit algorithm attains the optimal cumulative regret in the sample size dimension and a tight bound in the covariate dimension and the significant covariate dimension. In the data-poor regime, the G-MCP-Bandit algorithm maintains a tight regret upper bound. In addition, we develop a local linear approximation method, the two-step weighted Lasso procedure, to identify the minimax concave penalty (MCP) estimator for the G-MCP-Bandit algorithm when samples are not independent and identically distributed. Under this procedure, the MCP estimator can match the oracle estimator with high probability and converge to the true parameters at the optimal convergence rate. Finally, through experiments based on both synthetic and real data sets, we show that the G-MCP-Bandit algorithm outperforms other benchmarking algorithms in terms of cumulative regret and that the benefits of the G-MCP-Bandit algorithm increase in the data’s sparsity level and the size of the decision set. This paper was accepted by J. George Shanthikumar, big data analytics. Funding: This work was supported by the National Natural Science Foundation of China (NSFC) [Grants W2441021, 72371172, 72342023, and 71929101] and the National Science Foundation (NSF) [Grant DMS 1820702]. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2022.01557 .
Read full abstract