Malware Detection by Control-Flow Graph Level Representation Learning With Graph Isomorphism Network

Yun Gao,Yukiko Yamaguchi,Hajime Shimada,Hirokazu Hasegawa

doi:10.1109/access.2022.3215267

Yun Gao, Yukiko Yamaguchi + Show 2 more

Open Access

https://doi.org/10.1109/access.2022.3215267

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2022
Citations: 6	License type: CC BY-NC-ND 4.0

Affiliation: Nagoya University, National Institute of Informatics

Abstract

With society’s increasing reliance on computer systems and network technology, the threat of malicious software grows more and more serious. In the field of information security, malware detection has been a key problem that academia and industry are committed to solving. Machine learning is an effective method for processing large-scale data, such as the Gradient Boosting Decision Tree (GBDT) and deep neural network technology. Although these types of detection methods can deal with cyber threats, most feature extraction methods are based on the statistical information features of portable executable (PE) files and thus lack the decompiled code and execution flow structure of the PE samples. Therefore, we propose a Control-Flow Graph (CFG)- and Graph Isomorphism Network (GIN)-based malware classification system. The feature vectors of CFG basic blocks are generated using the large-scale pre-trained language model MiniLM, which is beneficial for the GIN to further learn and compress the CFG-based representation, and classified with multi-layer perceptron. In addition, we evaluated the effectiveness of the representation under different dimensions and classifiers. To evaluate our method, we set up a CFG-based malware detection graph dataset from a PE file of the Blue Hexagon Open Dataset for Malware Analysis (BODMAS), which we call the Malware Geometric Binary Dataset (MGD-BINARY) and collected the experimental results of CFG representation in different dimensions and classifier settings. The evaluation results show that our proposal has proved an Accuracy metric of 0.99160 and achieved 0.99148 Area Under the Curve (AUC) results.

Full Text