Functions-based CFG Embedding for Malware Homology Analysis

Jieran Liu,Yuan Shen,Hanbing Yan

doi:10.1109/ict.2019.8798769

Abstract

Malware homology analysis aims at detecting whether different malicious code originates from the same set of malicious code or is written by the same author or team, and whether it has intrinsic relevance and similarity. At the same time, the homology analysis of malicious code is also an important part of studying the groups behind different APT (Advanced Persistent Threat) attacks. At present, homology identification still relies on manual analysis and security experts' experience in the anti-malware industry. In addition, research on large-scale malicious code automated homology analysis is still insufficient. The method proposed in this paper is to solve the problem of large-scale malicious code homology automatic analysis, and hope to provide auxiliary information for discovering the group behind the APT attack. In this paper, we collected samples of different APT groups from public threat intelligence and proposed a novel approach to classify these samples into different APT groups to further analyze the homology of malware. We combined the CFG (Control Flow Graph) of the malicious code function and the disassembled code of the stripped malware to generate the embedding, i.e., a numeric vector, which formed a function feature database of the APT group, and presented a neural network model used for APT group classification. We have implemented our approach in a prototype system called MCrab. Our extensive evaluation showed that MCrab could produce high accuracy results, with few to no false positives. Our research also showed that deep learning can be successfully applied to malware homology analysis.

Full Text