Hybrid resampling and weighted majority voting for multi-class anomaly detection on imbalanced malware and network traffic data

Liang Xue,Tianqing Zhu

doi:10.1016/j.engappai.2023.107568

Abstract

In a large skewed dataset, the data imbalance is severe and the classifier's accuracy is biased towards the majority class. Insufficient data makes it challenging for the classifier to learn the feature of the minority classes. Moreover, some existing techniques of binary classification cannot directly apply to multi-classification. Plain oversampling may generate redundant and irrelevant data for the minority classes while undersampling may eliminate too many features for the majority class. This paper proposes a combination of constrained undersampling, oversampling, noise cleaning, and a weighted majority voting classifier (WMVC) to detect multi-classification obfuscated malware via memory and network traffic anomalies. As a constraint on undersampling and oversampling, the proposed framework divides the total sample number by the class number to obtain the average class size. According to the constraint, one or more majority classes are down-sampled using Random Undersampling, while all the minority classes are up-sampled using Adaptive Synthetic Sampling Approach, which is followed by Tomek Link to remove the noisy data. Then a weighted majority voting classifier aggregated tree-based ensemble algorithms is designed and compared to the XGBoost and Convolutional Neural Network (CNN) classifiers and six state-of-the-art ensemble algorithms. The comparison results show that the performance of the classifiers on balanced data outperforms those on imbalanced data, and the WMVC outperforms XGBoost, CNN, and the six other ensemble algorithms. Our approach can alleviate the classifier's bias towards the majority class while improving its performance for the difficult minority class.

Full Text