Malware Classification with BERT

Joel Lawrence Alvares

doi:10.31979/etd.7n35-garb

Abstract

Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family. Word2Vec generates word embeddings where words with similar context will be closer. The word embeddings generated by Word2Vec would help classify malware samples belonging to a certain family based on similarity. Classification will be carried out using classifiers such as Support Vector Machines (SVM), Logistic Regression, Random Forests and Multi-Layer Perceptron (MLP). The classification accuracy of classification carried out by word embeddings generated by BERT can be compared with the accuracy of Word2Vec that would establish a baseline for results.

Full Text