Abstract

Malware detection is a complex problem. The commonly used signature-based technique cannot detect unknown or zero-day malware. Traditional machine learning-based methods can identify unknown malicious programs but require high domain expertise for feature engineering. This research study presents a new method of malware detection using the stacking of static opcode and bytecode features. At first, bytecode and opcode sequences of executables files are extracted using a disassembler. After that, an NLP technique, TF-IDF, is employed to vectorize the extracted feature map. The extracted features as opcodes and bytecodes are concatenated to combine a single feature map, which serves as input to train machine learning classifiers: SVM, k-NN, and Random Forest (RF). The RF classifier obtained the best accuracy, precision, recall, and F-score results as 98.47%, 98.60%,98.46%, and 98.47% even using small data counts of bench-marked Microsoft BIG data set. The accuracy of the best model is comparatively better than the accuracy of using bytecodes and opcodes as individual feature vectors. The empirical results signify that the proposed method is helpful to the security industries with the benefit of being less dependent on manual feature engineering and can reduce the burden on virus databases by handling large-scale malware.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call