ABSTRACT This paper discusses the impact of imbalanced datasets on ML models for malware classification and whether disproportionate distribution of various families affects the ability of an ML model to learn from minority classes. Classification results learnt from ML model trained on an imbalanced dataset were compared against those trained on a dataset balanced using Synthetic Minority Oversampling Technique (SMOTE) and Tomek Links. Four machine learning models used included RF, SVM, DT, and KNN. The comparative measures used to evaluate models were accuracy, precision, and recall. Using the balanced dataset improved precision and recall for minority malware families, with the implementation of SMOTE and Tomek Links leading to measurable performance improvements across most classifiers. For example, RF accuracy for detecting Trojans rose from 95.4% to 97.6%, demonstrating the benefits of removing noisy samples to refine decision boundaries. Although SVM accuracy declined from 93.55% to 84.63%, improved precision and recall across other classifiers highlight that balancing techniques enhanced models’ ability to classify minority samples, reducing misleading high accuracies caused by class imbalance.
Read full abstract