Machine learning is widely recognized as a promising data-driven modeling technique for the model-based control and optimization of building energy systems. However, the generalizability of data-driven models often faces significant challenges, as the available training data from building operations usually only covers a limited range of working conditions. Active learning can proactively test unseen and informative working conditions to enrich the training set by adding new data samples, leading to improved generalization performance of data-driven models. A novel distance and information density-based sample strategy is developed that accounts for the real-time status of building operation and outdoor environment. Based on Mahalanobis distance, this strategy determines the sampling value of an unlabeled sample (unseen working condition) by assessing its similarity to both the training samples and other unlabeled samples. As collecting sufficiently representative samples can be difficult, costly, and time-consuming, a distance-based sampling cost metric is proposed to compare the efficiency of different sampling methods, considering the detrimental effects of the actively sampling process on the normal operation of building energy systems. This paper presents a comprehensive and in-depth comparison of five active learning methods, including one incorporating the distance-based sampling strategy, by conducting data experiments on the data collected from the cooling towers of a real high-rise building. The results show that active learning can effectively identify informative data samples and improve the generalization performance of data-driven models. The research outcomes are valuable for enhancing AI-enabled data-driven modeling of building energy systems with substantial decreases in costs on data sampling.
Read full abstract