Abstract

Imbalanced datasets are usually a challenge on classification tasks, especially in the manufacturing industry. These skewed class distributions bring out the poor performance in traditional machine learning algorithms. In addition, most of the collected datasets contain noises that make the analysis process even harder. The noises could be the missing data or irrelevant variables in the datasets. Dealing with these noisy datasets remains an important step in data analysis. For these two reasons, we propose a Gradient Deep Learning Boosting (GDLB) model to deal with imbalanced datasets containing noises in the classification task. In dealing with noise, we use the Imputation transformer for handling the missing data and deployed the Random forest method for features selection. The two benchmark datasets named SECOM and DAIWM are implemented to prove our proposed method’s performance. Those are particular imbalance datasets containing noise. Our proposed method had an accuracy, recall, Matthews correlation coefficient, and Area under the curve of 0.87, 0.70, 0.32, and 0.79, respectively on the SECOM dataset. On the other hand, on the DAIWM dataset, our proposed method achieves 0.91, 0.83, 0.56, and 0.87 respectively. We found that the combination of proposed Gradient Deep Learning Boosting and handling noises is a prospective model for imbalanced datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.