Malicious code classification based on opcode sequences and textCNN network

Qianhui Wang,Quan Qian

doi:10.1016/j.jisa.2022.103151

Abstract

A malicious code classification problem is essential for the network security. Malicious code is the most common means of network attack, which threatens user information and property security. An effective malicious code classification method can improve the efficiency of malicious code detection and the ability to discover new malicious code families. This study proposes a new malicious code classification method to analyze, classify, and detect malicious code. The semantic features of opcode sequences are extracted effectively by introducing the concept of word vectors. Furthermore, the extracted sequence is regarded as a text sentence and then introduced to a text convolutional neural network (textCNN) to identify malicious code families. The experimental results revealed that the model has more than 98% accuracy (with macro-average precision above 98.65% and macro-average recall approximately 98.66%) on the Microsoft Malware Challenge dataset conducted in 2015. Meanwhile, the accuracy of the model on the SOREL-20M dataset is 91.93%. Mostly call instructions are used to call the API, library functions, and other user-defined functions through which the behavior of malicious code is generally realized. Thus, selecting the block that contains call instructions as the key block will reduce the model training speed. After selecting the key block, on average, the number of opcodes on Microsoft Malware Challenge dataset is reduced by 39.07% and has a 98.18% accuracy rate, which is slightly lower than the result obtained by using all opcodes. The number of opcodes on the SOREL-20M dataset is reduced by 30.49% on average, and the accuracy is increased to 93.46%. Experimental results show that the proposed algorithm works well and outperforms the results obtained by using byte n-gram representation.

Full Text