Abstract

Malware detection and classification play a critical role in computer and network security. Although, many machine learning models have been used in the detection of malicious binaries, however, the performance of ensemble methods has not been investigated extensively. Besides, the massive volume of malware has established it as a big data problem forcing security researchers and practitioners to deploy big data technologies to manage, store, analyze, and visualize malware data. In this paper, the authors have designed two methods based on ensemble learning and big data for improving the performance of malware detection at a large scale. The first method is based on the weighted voting strategy of ensemble learning, and the second method chooses an optimal set of base classifiers for stacking purpose. The proposed methods are implemented using Apache Spark, a popular big data processing framework, and their performance is tested and evaluated on a dataset of 198,350 Windows files including 100,200 malicious and 98,150 benign samples. The experimental results successfully validate the effectiveness of the proposed approach since it improves the generalization performance in detecting new malware.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call