Machine learning based mobile malware detection using highly imbalanced network traffic

Zhenxiang Chen,Qiben Yan,Hongbo Han,Shanshan Wang,Lizhi Peng,Lin Wang,Bo Yang

doi:10.1016/j.ins.2017.04.044

Zhenxiang Chen, Qiben Yan + Show 5 more

Open Access

https://doi.org/10.1016/j.ins.2017.04.044

Copy DOI

Journal: Information Sciences	Publication Date: Apr 29, 2017
Citations: 129	License type: publisher-specific-oa

Affiliation: University of Jinan, University of Nebraska–Lincoln

Abstract

In recent years, the number and variety of malicious mobile apps have increased drastically, especially on Android platform, which brings insurmountable challenges for malicious app detection. Researchers endeavor to discover the traces of malicious apps using network traffic analysis. In this study, we combine network traffic analysis with machine learning methods to identify malicious network behavior, and eventually to detect malicious apps. However, most network traffic generated by malicious apps is benign, while only a small portion of traffic is malicious, leading to an imbalanced data problem when the traffic model skews towards modeling the benign traffic. To address this problem, we introduce imbalanced classification methods, including the synthetic minority oversampling technique (SMOTE) + support vector machine (SVM), SVM cost-sensitive (SVMCS), and C4.5 cost-sensitive (C4.5CS) methods. However, when the imbalance rate reaches a certain threshold, the performance of common imbalanced classification algorithms degrades significantly. To avoid performance degradation, we propose to use the imbalanced data gravitation-based classification (IDGC) algorithm to classify imbalanced data. Moreover, we develop a simplex imbalanced data gravitation classification (S-IDGC) model to further reduce the time costs of IDGC without sacrificing the classification performance. In addition, we propose a machine learning based comparative benchmark prototype system, which provides users with substantial autonomy, such as multiple choices of the desired classifiers or traffic features. Using this prototype system, users can compare the detection performance of different classification algorithms on the same data set, as well as the performance of a specific classification algorithm on multiple data sets.

Full Text