Gaussian distribution resampling via Chebyshev distance for food computing

Tianle Li,Enguang Zuo,Cheng Chen,Chen Chen,Jie Zhong,Junyi Yan,Xiaoyi Lv

doi:10.1016/j.asoc.2023.111103

Abstract

The problem of data imbalance often occurs in the real-world food domain. Traditional classification algorithms are prone to overfitting on imbalanced datasets, and the decision surface will be biased toward majority-class samples, making it difficult to identify minority-class samples. Although previous resampling techniques can deal with the imbalance problem by balancing the dataset, they may produce class overlap because the anchor samples are not appropriately selected and the generated dataset does not conform to the original distribution. This paper proposes an adaptive resampling technique based on Gaussian distribution oversampling combined with random undersampling (GDRS) to address the abovementioned problems. The technique is based on the Chebyshev distance combining the weight information of the minority-class samples to select a suitable anchor sample. A new dataset conforming to the original distribution is generated in the form of a Gaussian distribution around the anchor sample. Then the random undersampling technique is combined to reduce the possibility of overfitting. The technique is applied to five UCI datasets and compared with seven imbalanced learning methods. The experimental results demonstrate that our method GDRS yields optimal performance. We also validate the effectiveness of our method in dealing with real dairy datasets with different imbalance ratios, which is promising for application in the food field.

Full Text