Abstract

Improvement of digital technology has caused the collected data sizes to increase at an accelerating rate. The increase in data size comes with new problems such as unbalanced data. If a dataset is unbalanced, the classes are not equally distributed. Therefore, classification of the data causes performance losses since the classification algorithms treat as the datasets are balanced. While the classification favors the majority class, the minority class is often misclassified. The majority of collected datasets, especially medical datasets, have an unbalanced distribution problem. To reduce the unbalance datasets, various studies have been performed in recent years. In general terms, these studies are undersampling, oversampling, or both to balance the datasets. In this study, an oversampling method is proposed employing distance and mean based resampling method to produce synthetic samples. For the resampling process, the distances between pairs are calculated by the Euclidean distance in the minority class. The calculated distances are considered in the sense of DBSCAN to obtain a sufficient amount of pairs. The new synthetic samples were formed between listed pairs by using the Weighted Arithmetic Mean. Thus, the dataset has been approximated 500 (majority) and 535 (from 268 minority data). The Random Forest (RF) and Support Vector Machine (SVM) algorithms are used for classification the raw and balanced datasets, and the results were compared with each other and the other well known methods such as Random Over Sampling (ROS), Random Under Sampling (RUS), and Synthetic Minority Oversampling Technique (SMOTE). The result showed that the proposed method has the best performance among all the listed methods. The accuracy performance of RF is 0.751 and 0.798 for raw data and resampled data respectively. Likewise, the accuracy performance of SVM is 0.762 and 0.781 for raw data and resampled data respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call