Abstract

Traditional machine learning (ML) based malware detectors depend on crafted human features that fail for recent malware. Deep learning (DL) based solutions solve the above issue but require a lot of training time. The real challenge is designing a malware detector with a higher F1-score for ML techniques. In this paper, we present a novel framework that classifies malware using the features named opcode and application programming interface (API) calls. First, API calls and opcodes are extracted using interactive disassembler pro (IDA pro) from the malicious samples' assembly language source code (ALSC) file. Then, the continuous n-gram technique is applied to the extracted API and opcode to create the dataset's features. The value of features in each row is based on its frequency in the concerned extracted behaviors. We scale the values in the dataset using the term frequency-inverse document frequency (TF-IDF) methodology. The best combination of n-gram and feature selection techniques is identified for API and opcode based datasets. The final label of the malicious samples is decided by the highest probability of the detection made by API and opcode based detectors. For analysis, an off-the-shelf dataset named Microsoft Malware is used. We achieve an F1-score of 96% for API-based detector and F1-score of 98% for opcode based detector. Our framework achieves an overall F1-score of 99.3%, better than the recent state-of-the-art techniques. Apart from attaining a higher Fl-score, there is a reduction in training time due to using ML techniques instead of DL techniques,

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call