FUSION: Measuring Binary Function Similarity with Code-Specific Embedding and Order-Sensitive GNN

Hao Gao,Lina Wang,Tong Zhang,Songqiang Chen,Fajiang Yu

doi:10.3390/sym14122549

Abstract

Binary code similarity measurement is a popular research area in binary analysis with the recent development of deep learning-based models. Current state-of-the-art methods often use the pre-trained language model (PTLM) to embed instructions into basic blocks as representations of nodes within a control flow graph (CFG). These methods will then use the graph neural network (GNN) to embed the whole CFG and measure the binary similarities between these code embeddings. However, these methods almost directly treat the assembly code as a natural language text and ignore its code-specific features when training PTLM. Moreover, They barely consider the direction of edges in the CFG or consider it less efficient. The weaknesses of the above approaches may limit the performances of previous methods. In this paper, we propose a novel method called function similarity using code-specific PPTs and order-sensitive GNN (FUSION). Since the similarity of binary codes is a symmetric/asymmetric problem, we were guided by the ideas of symmetry and asymmetry in our research. They measure the binary function similarity with two code-specific PTLM training strategies and an order-sensitive GNN, which, respectively, alleviate the aforementioned weaknesses. FUSION outperforms the state-of-the-art binary similarity methods by up to 5.4% in accuracy, and performs significantly better.

Full Text