Ensemble Learning Methods for Dirty Data: A Keynote at CIKM 2022

Ling Liu

doi:10.1145/3636341.3636346

Abstract

Neural network ensemble is a collaborative learning paradigm that utilizes multiple neural networks to solve a complex learning problem. Constructing predictive models with high generalization performance is an important and yet most challenging goal for robust AI systems in the presence of dirty data. Given a target learning task, popular approaches have been dedicated to designing and finding the top performing model. However, it is difficult in general to estimate the best model when available data is finite, possibly dirty, or insufficient for the problem. The problem of dirty data in machine learning (ML) can be characterized by the out of distribution data and the digital or physical deception of data. Such dirty data may cause unintended or harmful behavior for well trained ML models. In this paper, a curated version of my keynote at ACM CIKM 2022, I will first give a brief overview of ensemble learning methodology. Then I will review different types of dirty data that could deceive well-trained ML models. Finally, I will describe a focal diversity optimized ensemble learning framework, developed at Georgia Tech, for measuring, enforcing, and combining multiple neural networks, delivering high generalization performance of ensemble learner, while maximizing ensemble utility and resilience to dirty data. Date: 17 October 2022.

Full Text