Density-based weighting for imbalanced regression

Michael Steininger,Anna Krause,Padraig Davidson,Andreas Hotho,Konstantin Kobs

doi:10.1007/s10994-021-06023-5

Abstract

In many real world settings, imbalanced data impedes model performance of learning algorithms, like neural networks, mostly for rare cases. This is especially problematic for tasks focusing on these rare occurrences. For example, when estimating precipitation, extreme rainfall events are scarce but important considering their potential consequences. While there are numerous well studied solutions for classification settings, most of them cannot be applied to regression easily. Of the few solutions for regression tasks, barely any have explored cost-sensitive learning which is known to have advantages compared to sampling-based methods in classification tasks. In this work, we propose a sample weighting approach for imbalanced regression datasets called DenseWeight and a cost-sensitive learning approach for neural network regression with imbalanced data called DenseLoss based on our weighting scheme. DenseWeight weights data points according to their target value rarities through kernel density estimation (KDE). DenseLoss adjusts each data point’s influence on the loss according to DenseWeight, giving rare data points more influence on model training compared to common data points. We show on multiple differently distributed datasets that DenseLoss significantly improves model performance for rare data points through its density-based weighting scheme. Additionally, we compare DenseLoss to the state-of-the-art method SMOGN, finding that our method mostly yields better performance. Our approach provides more control over model training as it enables us to actively decide on the trade-off between focusing on common or rare cases through a single hyperparameter, allowing the training of better models for rare data points.

Highlights

Many machine learning algorithms, like neural networks, typically expect roughly uniform target distributions (Cui et al 2019; Krawczyk 2016; Sun et al 2009)
Our contributions are as follows: (i) We propose DenseWeight, a sample weighting approach for regression with imbalanced data. (ii) We propose DenseLoss, a costsensitive learning approach based on DenseWeight for neural network regression models with imbalanced data. (iii) We analyze DenseLoss ’s influence on performance for common and rare data points using synthetic data. (iv) We compare DenseLoss to the state-of-the-art imbalanced regression method SMOGN, finding that our method typically provides better performance. (v) We apply DenseLoss to the heavily imbalanced
The results show for the rarest bins that DenseLoss provides the best performance for 8 datasets while SMOGN only performs best on 3 datasets and applying no method is best for only 2 datasets

Summary

Introduction

Like neural networks, typically expect roughly uniform target distributions (Cui et al 2019; Krawczyk 2016; Sun et al 2009). For regression there should be a similar density of samples across the complete target value range. Many datasets exhibit skewed target distributions with target values in certain ranges occurring less frequently than others. Models can become biased, leading to better performance for common cases than for rare cases (Cui et al 2019; Krawczyk 2016). This is problematic for tasks where these rare occurrences are of special interest. Examples include precipitation estimation, where extreme rainfall is rare but can have dramatic consequences, or fraud detection, where rare fraudulent events are supposed to be detected

Objectives

Methods

Results

Discussion

Conclusion