SeMalBERT: Semantic-based malware detection with bidirectional encoder representations from transformers

Junming Liu,Yuntao Zhao,Yongxin Feng,Yutao Hu,Xiangyu Ma

doi:10.1016/j.jisa.2023.103690

Abstract

Machine learning models are widely used for identifying malicious software. However, existing models suffer from issues such as imprecise polysemous representations and a lack of contextual semantic representations, leading to the failure to recognize certain types of malicious software. In this paper, we propose a semantic-based intelligent malware detection model called SeMalBERT for identifying malicious software in Windows systems. Specifically, the model utilizes the API function sequences of malicious software as the learned features. Firstly, BERT is applied to accomplish word representation tasks and extract semantic information from the sequences. Secondly, a hybrid discriminator based on Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) is used to explore the chaining relationships between functions. Lastly, an external attention mechanism is added after the LSTM to enable the model to better focus on key information within the text. Experimental results demonstrate that SeMalBERT outperforms existing malware detection techniques in terms of accuracy, F1 score, and loss function value on a general dataset.

Full Text