Ensemble Methods for Malware Diagnosis Based on One-class SVMs

Xing An

doi:10.31390/gradschool_theses.2294

Abstract

Malware diagnosis is one of today’s most popular topics of machine learning. Instead of simply applying all the classical classification algorithms to the problem and claim the highest accuracy as the result of prediction, which is the typical approach adopted by studies of this kind, we stick to the Support Vector Machine (SVM) classifier and based on our observation of some principles of learning, characteristics of statistics and the behavior of SVM, we employed a number of the potential preprocessing or ensemble methods including rescaling, bagging and clustering that may enhance the performance to the classical algorithm. We implemented the idea of rescaling by iteratively magnifying the attributes used by the support vectors of SVM and eliminating those unused ones from the training data examples until a maximum accuracy is achieved. Our study of bagging and clustering focused on the situation where only examples of malware are available and one-class SVM is used. For both methods, a group of models is built using part of the training data instead of building one model with the whole training data set. We also compared the effect of two possible coordination approaches for the sub-models acquired in the training process, namely, voting and one positive to be positive. Results of experiments showed that when utilized together with appropriate coordination methods, ensemble methods can effectively decrease both the cases where malware is labeled as clean or clean software is classified as malware, which are formally known as false-negative and false-positive errors in our context respectively.

Full Text