Research on the Construction of Malware Variant Datasets and Their Detection Method

Faming Lu,Zedong Lin,Yunxia Bao,Mengfan Tang,Zhaoyang Cai

doi:10.3390/app12157546

Abstract

Malware detection is of great significance for maintaining the security of information systems. Malware obfuscation techniques and malware variants are increasingly emerging, but their samples and API (application programming interface) sequences are difficult to obtain. This poses difficulties for the development of malware variant detection models. To address this issue in this paper, we first generated a malware variant dataset using the obfuscation technique based on the disassembly and decompilation of malware. Then, an API call dataset of these malware variants was constructed through sandboxing. Compared to similar work, the malware variants and their obfuscated API call sequences generated in this paper were all runnable. After that, taking a public API call sequence dataset of obfuscation-free malware as input, a BERT (bidirectional encoder representation from transformers) pretrained model for malware detection was constructed. To enhance the ability of this pretrained model to handle obfuscation and variants, in this paper, we used adversarial training to improve the robustness and generalization of the detection model under obfuscation. As the experimental results show, the proposed scheme can improve the classification performance of malware variants under obfuscation. The accuracy of the malware variant classification was close to that of the unobfuscated case.

Full Text