Abstract

In recent years, researchers have made a great success on the automatic classification and detection of malware utilizing machine learning methods. However, most machine learning based approaches over rely on the training samples such that a new malware family not belonging to the training set cannot be identified. To address such issue, we propose a soft relevance value ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">s</i> -value), a new evaluating way of feature soft relevance that uses the mixed distance criterion to assess classified results. Specifically, we leverage the mixed distance criterion from pattern recognition to distinguish testing samples as a new family which is not labeled in training set. Finally, we evaluate how <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">s</i> -value can be used to distinguish and classify a new malware family with the malware datasets from the Research Prediction Competition of Microsoft Malware Classification Challenge and Windows (Kaggle). The experimental results show that, the train-ing time is approximately 12 hours, while the prediction time is only ∼0.5 second. Comparing against the Kaggle winner, our time costs for training and pprediction only occupy 16.7% and 3.8% of the winner.s time costs, respectively. The accuracy of classifying malware reaches 99.8%. Such results indicates that our proposed <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">s</i> -value achieves a balance in accuracy, training and prediction time, and outperforms the state-of-the-art machine learning based malware detection approaches. Besides, our method is able to identify new malware families that are not included in the training set.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call