Malware Family Classification using LSTM with Attention

Qi Xie,Yongjun Wang,Zhiquan Qin

doi:10.1109/cisp-bmei51763.2020.9263499

Abstract

As the damage caused by malware gets severer and involves a wider range of fields, the capability of detecting and classifying the malware becomes increasingly urgent and significant. Modern malware are usually equipped with metamorphic and polymorphic techniques, which means that malware from the same family might be modified. It is noteworthy that dynamic analysis can partially tackle those anti-static techniques. However, many malware authors have already realized that more dynamic methods are employed in malware analysis. Therefore, this field still suffers heavily from the anti-dynamic techniques. To address these problems, this paper first utilizes the static approach to disassembly the malware and obtains the assembly code. Harnessing the power of word embedding, our method then effectively learns the relationships of instructions within each block split by jump instructions and represents them as vectors. These vectors are fed into the LSTM to get each block's representative features. We then incorporate the attention mechanism to reduce these junk codes which belongs to one of the anti-static analysis techniques or atypical features, and thus our method can obtain significantly better feature representation capability of a malware file. Empirical experiments show that our method prevails its competitors and achieves the best performance with an accuracy of 94.25% and an f1 score of 0.95 on a dataset with 16,718 samples from 6 malware families.

Full Text