Grouped-sampling technique to deal with unbalance in Raman spectral data modeling

Haitao Song,Hongyong Leng,Zhuoya Hou,Rui Gao,Cheng Chen,Chunzhi Meng,Jinshan Sun,Chenxi Li,Binlin Ma

doi:10.1016/j.pdpdt.2022.103059

Abstract

Due to limitations in disease prevalence and hospital specificity, spectral data are often collected with unbalanced sample size. To solve this problem, a new sampling method – grouped-sampling was proposed in this research, which is shown to be effective for unbalanced data. It avoids over-fitting of over-sampling and overcomes under-sampling utilization of under-sampling. In this study, we applied grouped-sampling to two unbalanced datasets where the sample proportions are 199:40 and 75:225. And then verified from two classic models: PCA-SVM (Principal Component Analysis-Support Vector Machine) and the deep learning algorithm GoogLeNet. The accuracy of these two datasets were 85.11% and 96.15% in PCA-SVM and 85.10% and 84.61% on GoogLeNet. Also, the F1-score were evaluated to measure the classification balance of sampling method, and result shows that F1-score of grouped-sampling is always the highest compared to over-sampling and under-sampling. In summary, compared to traditional sampling methods, grouped-sampling performs better on prediction for classes with smaller sample size, which means grouped-sampling can improve the balance of classification results and the potential of practical application. Therefore, we develop a group sampling method that distinguishes between under- and over-sampling, which greatly improves the accuracy and balance of predictions for unbalanced samples.

Full Text