Machine learning-based malware detection using stacking of opcodes and bytecode sequences

Manoj Sai,Kajal Panda,Sanjeev Kumar,Aakansha Tyagi

doi:10.1109/pdgc56933.2022.10053307

Abstract

Malware detection is a complex problem. The commonly used signature-based technique cannot detect unknown or zero-day malware. Traditional machine learning-based methods can identify unknown malicious programs but require high domain expertise for feature engineering. This research study presents a new method of malware detection using the stacking of static opcode and bytecode features. At first, bytecode and opcode sequences of executables files are extracted using a disassembler. After that, an NLP technique, TF-IDF, is employed to vectorize the extracted feature map. The extracted features as opcodes and bytecodes are concatenated to combine a single feature map, which serves as input to train machine learning classifiers: SVM, k-NN, and Random Forest (RF). The RF classifier obtained the best accuracy, precision, recall, and F-score results as 98.47%, 98.60%,98.46%, and 98.47% even using small data counts of bench-marked Microsoft BIG data set. The accuracy of the best model is comparatively better than the accuracy of using bytecodes and opcodes as individual feature vectors. The empirical results signify that the proposed method is helpful to the security industries with the benefit of being less dependent on manual feature engineering and can reduce the burden on virus databases by handling large-scale malware.

Full Text