This study presents a novel multi-step lookahead Bayesian optimization method which strives for optimal active learning by balancing exploration and exploitation over multiple future sampling-evaluation trials. The approach adopts a Gaussian process (GP) model to represent the underlying function, which is updated after each sampling and evaluation. Then, a reinforcement learning method of Proximal Policy Optimization (PPO) is used to locate the next optimal point to sample while considering multiple future such trials using the current GP model as the fictitious environment. The approach is applied to batch-to-batch (B2B) optimization where an optimal batch recipe is searched for without any process knowledge. The B2B optimization is formulated as a partially observable Markov decision process (POMDP) problem, and GP model learning and policy learning through PPO are iteratively performed to suggest the next batch recipe. The effectiveness of the approach in the B2B optimization problem is demonstrated through two case studies.
Read full abstract