Abstract

Machine learning models are widely used for identifying malicious software. However, existing models suffer from issues such as imprecise polysemous representations and a lack of contextual semantic representations, leading to the failure to recognize certain types of malicious software. In this paper, we propose a semantic-based intelligent malware detection model called SeMalBERT for identifying malicious software in Windows systems. Specifically, the model utilizes the API function sequences of malicious software as the learned features. Firstly, BERT is applied to accomplish word representation tasks and extract semantic information from the sequences. Secondly, a hybrid discriminator based on Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) is used to explore the chaining relationships between functions. Lastly, an external attention mechanism is added after the LSTM to enable the model to better focus on key information within the text. Experimental results demonstrate that SeMalBERT outperforms existing malware detection techniques in terms of accuracy, F1 score, and loss function value on a general dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call