Synthetic minority oversampling for function approximation problems

Lourdes Pelayo,Scott Dick

doi:10.1002/int.22120

Abstract

Imbalanced data sets are a common occurrence in important machine learning problems. Research in improving learning under imbalanced conditions has largely focused on classification problems (ie, problems with a categorical dependent variable). However, imbalanced data also occur in function approximation, and far less attention has been paid to this case. We present a novel stratification approach for imbalanced function approximation problems. Our solution extends the SMOTE oversampling preprocessing technique to continuous-valued dependent variables by identifying regions of the feature space with a low density of examples and high variance in the dependent variable. Synthetic examples are then generated between nearest neighbors in these regions. In an empirical validation, our approach reduces the normalized mean-squared prediction error in 18 out of 21 benchmark data sets, and compares favorably with state-of-the-art approaches.

Full Text