RCFG2Vec: Considering Long-Distance Dependency for Binary Code Similarity Detection

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Binary code similarity detection(BCSD), as a fundamental technique in software security, has various applications, including malware family detection, known vulnerability detection and code plagiarism detection. Recent deep learning-based BCSD approaches have demonstrated promising performance. However, they face two significant challenges that limit detection performance. First, most approaches that use sequence networks (like RNN and Transformer) utilize coarse-grained tokenization methods, which results in large vocabulary size and severe out-of-vocabulary (OOV) problem. Second, CFG-based methods typically use variants of graph convolutional networks, which only consider local structural information and discard long-distance dependencies between basic blocks.

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.3390/electronics13091715
A Survey of Binary Code Similarity Detection Techniques
  • Apr 29, 2024
  • Electronics
  • Liting Ruan + 4 more

Binary Code Similarity Detection is a method that involves comparing two or more binary code segments to identify their similarities and differences. This technique plays a crucial role in areas such as software security, vulnerability detection, and software composition analysis. With the extensive use of binary code in software development and system optimization, binary code similarity detection has become an important area of research. Traditional methods of source code similarity detection face challenges when dealing with the unreadable and complex nature of binary code, necessitating specialized techniques and algorithms. This review compares and summarizes various techniques and methods of binary code similarity detection, highlighting their strengths and limitations in handling different characteristics of binary code. Additionally, the article suggests potential future research directions. As research and innovation in this technology continue to advance, binary code similarity detection is expected to play an increasingly significant role in fields like software security.

  • Research Article
  • Cite Count Icon 3
  • 10.1155/2022/4095481
SROBR: Semantic Representation of Obfuscation‐Resilient Binary Code
  • Jan 1, 2022
  • Wireless Communications and Mobile Computing
  • Ke Tang + 8 more

With the rapid development of information technology, the scale of software has increased exponentially. Binary code similarity detection technology plays an important role in many fields, such as detecting software plagiarism, vulnerabilities discovery, and copyright solution issues. Nevertheless, what cannot be ignored is that a variety of approaches to binary code semantic representation have been introduced recently, but few can catch up with existing code obfuscation techniques due to their maturing and extensive development. In order to solve this problem, we propose a new neural network model, named SROBR, which is a deep integration of natural language processing model and graph neural network. In SROBR, BERT is applied to capture sequence information of the binary code at the first place, and then GAT is utilized to capture the structural information. It combines natural language processing and graph neural network, which can capture the semantic information of binary programs while resisting obfuscation options in a more efficient way. Through binary code similarity detection task and obfuscated option classification task, the experimental results demonstrate that SROBR outperforms existing binary similarity detection methods in resisting obfuscation techniques.

  • Book Chapter
  • 10.1007/978-3-031-20738-9_4
Cross Architecture Function Similarity Detection with Binary Lifting and Neural Metric Learning
  • Jan 1, 2023
  • Zhenzhou Tian + 2 more

Binary code similarity detection has extensive and important applications in IoT device security, yet which suffers the challenges from the differentiated underlying architectures of the diverse IoT devices. To this end, this paper presents XFSim (Cross-architecture Function-level binary code Similarity detection), through binary lifting and neural similarity metric learning. Firstly, to make the detection method architecture agnostic, the binaries to be analyzed are lifted to an intermediate code called LLVM-IR and normalized for an uniform representation, so as to alleviate the discrepancies between the raw assemblies of different instruction set architectures (ISAs). Secondly, we utilize FastText, a widely used word embedding algorithm, that learns on the functions’ normalized intermediate codes to obtain high quality token embeddings. Then, an efficient CNN-based model is utilized to encode the semantics of each function into numerical vectors, meanwhile the siamese neural network structure is resorted to supervise the whole model training, with the goal of minimizing the contrastive loss. Finally, the similarity of two binary code snippets can measured by the cosine similarity of their encoded vectors. The experiments conducted on a public dataset show that, the strategy of lifting and normalizing the assemblies to uniform representations greatly alleviates the semantic-gaps between different ISAs, and XFSim outperforms two existing cross-architecture binary code similarity detectors.KeywordsBinary code similarity detectionInstruction set architectureNeural networkBinary lifting

  • Research Article
  • Cite Count Icon 4
  • 10.1109/access.2023.3259481
Binary Code Representation With Well-Balanced Instruction Normalization
  • Jan 1, 2023
  • IEEE Access
  • Hyungjoon Koo + 3 more

The recovery of contextual meanings on a machine code is required by a wide range of binary analysis applications, such as bug discovery, malware analysis, and code clone detection. To accomplish this, advancements on binary code analysis borrow the techniques from natural language processing to automatically infer the underlying semantics of a binary, rather than replying on manual analysis. One of crucial pipelines in this process is instruction normalization, which helps to reduce the number of tokens and to avoid an out-of-vocabulary (OOV) problem. However, existing approaches often substitutes operands with a common token ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">e.g</i> ., callee target → FOO), inevitably resulting in the loss of important information. In this paper, we introduce <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">well-balanced instruction normalization</i> (WIN), a novel approach that retains rich code information while minimizing the downsides of code normalization.With large swaths of binary code, our finding shows that the instruction distribution follows Zipf’s Law like a natural language, a function conveys contextually meaningful information, and the same instruction at different positions may require diverse code representations. To show the effectiveness of WIN, we present DeepSemantic that harnesses the BERT architecture with two training phases: pre-training for generic assembly code representation, and fine-tuning for building a model tailored to a specialized task. We define a downstream task of binary code similarity detection, which requires underlying code semantics. Our experimental results show that our binary similarity model with WIN outperforms two state-of-the-art binary similarity tools, DeepBinDiff and SAFE, with an average improvement of 49.8% and 15.8%, respectively.

  • Research Article
  • Cite Count Icon 9
  • 10.1007/s44196-023-00206-9
DeepDual-SD: Deep Dual Attribute-Aware Embedding for Binary Code Similarity Detection
  • Mar 17, 2023
  • International Journal of Computational Intelligence Systems
  • Jiabao Guo + 5 more

Binary code similarity detection (BCSD) is a task of detecting similarity of binary functions which are not available to the corresponding source code. It has been widely utilized to facilitate various kinds of crucial security analysis in software engineering. Because of the complexity of the program compilation process, identifying binary code similarity presents tough challenges. The most sensible binary similarity detector relies on a robust vector representation of binary code. However, few BCSD approaches are suitable to form vector representations for analyzing similarities between binaries, which may not only diverge in semantics but also in structures. And the existing solutions which only depend on hands-on feature engineering to form feature vectors, fail to take into consideration the relationships between instructions. To resolve these problems, we propose a novel and unified approach called DeepDual-SD that aims to combine the dual attributes (semantic and structural attribute). More specifically, DeepDual-SD consists of two branches, in which one text-based feature representation is driven by semantic attribute learning to exploit instruction semantics, another graph-based feature representation for structural attribute learning to investigate structural differences. Meanwhile deep embedding (DE) technology is utilized to map this information into low-dimensional vector representation. In addition, to get together the dual attributes, a fusion mechanism based on gate architecture is designed for learning to pay proper attention between the two attribute-aware embeddings. Experimental verifications are conducted on Openssl and Debian datasets for several tasks, including cross-compiler, cross-architecture and cross-version scenarios. The results demonstrate that our method outperforms the state-of-the-art BCSD methods in different scenarios in terms of detection accuracy.

  • Research Article
  • Cite Count Icon 47
  • 10.1145/3604611
Asteria-Pro: Enhancing Deep Learning-based Binary Code Similarity Detection by Incorporating Domain Knowledge
  • Nov 24, 2023
  • ACM Transactions on Software Engineering and Methodology
  • Shouguo Yang + 6 more

Widespread code reuse allows vulnerabilities to proliferate among a vast variety of firmware. There is an urgent need to detect these vulnerable codes effectively and efficiently. By measuring code similarities, AI-based binary code similarity detection is applied to detecting vulnerable code at scale. Existing studies have proposed various function features to capture the commonality for similarity detection. Nevertheless, the significant code syntactic variability induced by the diversity of IoT hardware architectures diminishes the accuracy of binary code similarity detection. In our earlier study and the tool Asteria , we adopted a Tree-LSTM network to summarize function semantics as function commonality, and the evaluation result indicates an advanced performance. However, it still has utility concerns due to excessive time costs and inadequate precision while searching for large-scale firmware bugs. To this end, we propose a novel deep learning-enhancement architecture by incorporating domain knowledge-based pre-filtration and re-ranking modules, and we develop a prototype named Asteria-Pro based on Asteria . The pre-filtration module eliminates dissimilar functions, thus reducing the subsequent deep learning-model calculations. The re-ranking module boosts the rankings of vulnerable functions among candidates generated by the deep learning model. Our evaluation indicates that the pre-filtration module cuts the calculation time by 96.9%, and the re-ranking module improves MRR and Recall by 23.71% and 36.4%, respectively. By incorporating these modules, Asteria-Pro outperforms existing state-of-the-art approaches in the bug search task by a significant margin. Furthermore, our evaluation shows that embedding baseline methods with pre-filtration and re-ranking modules significantly improves their precision. We conduct a large-scale real-world firmware bug search, and Asteria-Pro manages to detect 1,482 vulnerable functions with a high precision 91.65%.

  • Research Article
  • Cite Count Icon 8
  • 10.3934/mbe.2021230
Cross-platform binary code similarity detection based on NMT and graph embedding.
  • Jan 1, 2021
  • Mathematical Biosciences and Engineering
  • Xiaodong Zhu + 2 more

Cross-platform binary code similarity detection is determining whether a pair of binary functions coming from different platforms are similar, and plays an important role in many areas. Traditional methods focus on using platform-independent characteristic strands intersecting or control flow graph (CFG) matching to compute the similarity and have shortages in terms of efficiency and scalability. The existing deep-learning-based methods improve the efficiency but have a low accuracy and still using manually constructed features. Aiming at these problems, a cross-platform binary code similarity detection method based on neural machine translation (NMT) and graph embedding is proposed in this manuscript. We train an NMT model and a graph embedding model to automatically extract two parts of semantics of the binary code and represent it as a high-dimension vector, named an embedding. Then the similarity of two binary functions can be measured by the distance between their corresponding embeddings. We implement a prototype named SimInspector. Our comparative experiment result shows that SimInspector outperforms the state-of-the-art approach, Gemini, by about 6% with respect to similarity detection accuracy, and maintains a good efficiency.

  • Research Article
  • Cite Count Icon 38
  • 10.1016/j.eswa.2020.114348
BinDeep: A deep learning approach to binary code similarity detection
  • Dec 3, 2020
  • Expert Systems with Applications
  • Donghai Tian + 5 more

BinDeep: A deep learning approach to binary code similarity detection

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/iaecst57965.2022.10062142
Research and implementation of obfuscation binary code similarity detection
  • Dec 9, 2022
  • Yang Zhang + 3 more

The problem of binary code similarity detection has made significant progress in malware detection. The comparison of similarity by file bytecode, assembly code, control flow graph, and so on has been applied sufficiently. Nevertheless, the above method must be revised in practical application to judge the similarity of artificially obfuscating binary code. Therefore, this paper proposes a method based on deep learning for binary similarity comparison, which works directly on function disassembly instruction sequences without manual feature extraction. Through the experiment, the improved method can get a good effect on the similarity detection of the binary code which has been obfuscated.

  • Research Article
  • Cite Count Icon 5
  • 10.3390/sym14122549
FUSION: Measuring Binary Function Similarity with Code-Specific Embedding and Order-Sensitive GNN
  • Dec 2, 2022
  • Symmetry
  • Hao Gao + 4 more

Binary code similarity measurement is a popular research area in binary analysis with the recent development of deep learning-based models. Current state-of-the-art methods often use the pre-trained language model (PTLM) to embed instructions into basic blocks as representations of nodes within a control flow graph (CFG). These methods will then use the graph neural network (GNN) to embed the whole CFG and measure the binary similarities between these code embeddings. However, these methods almost directly treat the assembly code as a natural language text and ignore its code-specific features when training PTLM. Moreover, They barely consider the direction of edges in the CFG or consider it less efficient. The weaknesses of the above approaches may limit the performances of previous methods. In this paper, we propose a novel method called function similarity using code-specific PPTs and order-sensitive GNN (FUSION). Since the similarity of binary codes is a symmetric/asymmetric problem, we were guided by the ideas of symmetry and asymmetry in our research. They measure the binary function similarity with two code-specific PTLM training strategies and an order-sensitive GNN, which, respectively, alleviate the aforementioned weaknesses. FUSION outperforms the state-of-the-art binary similarity methods by up to 5.4% in accuracy, and performs significantly better.

  • Research Article
  • Cite Count Icon 10
  • 10.3390/s23187789
IoTSim: Internet of Things-Oriented Binary Code Similarity Detection with Multiple Block Relations
  • Sep 11, 2023
  • Sensors (Basel, Switzerland)
  • Zhenhao Luo + 4 more

Binary code similarity detection (BCSD) plays a crucial role in various computer security applications, including vulnerability detection, malware detection, and software component analysis. With the development of the Internet of Things (IoT), there are many binaries from different instruction architecture sets, which require BCSD approaches robust against different architectures. In this study, we propose a novel IoT-oriented binary code similarity detection approach. Our approach leverages a customized transformer-based language model with disentangled attention to capture relative position information. To mitigate out-of-vocabulary (OOV) challenges in the language model, we introduce a base-token prediction pre-training task aimed at capturing basic semantics for unseen tokens. During function embedding generation, we integrate directed jumps, data dependency, and address adjacency to capture multiple block relations. We then assign different weights to different relations and use multi-layer Graph Convolutional Networks (GCN) to generate function embeddings. We implemented the prototype of IoTSim. Our experimental results show that our proposed block relation matrix improves IoTSim with large margins. With a pool size of , IoTSim achieves a recall@1 of 0.903 across architectures, outperforming the state-of-the-art approaches Trex, SAFE, and PalmTree.

  • Research Article
  • Cite Count Icon 2
  • 10.3390/app132312751
BlockMatch: A Fine-Grained Binary Code Similarity Detection Approach Using Contrastive Learning for Basic Block Matching
  • Nov 28, 2023
  • Applied Sciences
  • Zhenhao Luo + 4 more

Binary code similarity detection (BCSD) plays a vital role in computer security and software engineering. Traditional BCSD methods heavily rely on specific features and necessitate rich expert knowledge, which are sensitive to code alterations. To improve the robustness against minor code alterations, recent research has shifted towards machine learning-based approaches. However, existing BCSD approaches mainly focus on function-level matching and face challenges related to large batch optimization and high quality sample selection at the basic block level. To overcome these challenges, we propose BlockMatch, a novel fine-grained BCSD approach that leverages natural language processing (NLP) techniques and contrastive learning for basic block matching. We treat instructions of basic blocks as a language and utilize a DeBERTa model to capture relative position relations and contextual semantics for encoding instruction sequences. For various operands in binary code, we propose a root operand model pre-training task to mitigate semantic missing of unseen operands. We then employ a mean pooling layer to generate basic block embeddings for detecting binary code similarity. Additionally, we propose a contrastive training framework, including a block augmentation model to generate high-quality training samples, improving the effectiveness of model training. Inspired by contrastive learning, we adopt the NT-Xent loss as our objective function, which allows larger sample sizes for model training and mitigates the convergence issues caused by limited local positive/negative samples. By conducting extensive experiments, we evaluate BlockMatch and compare it against state-of-the-art approaches such as PalmTree and SAFE. The results demonstrate that BlockMatch achieves a recall@1 of 0.912 at the basic block level under the cross-compiler scenario (pool size = 10), which outperforms PalmTree (0.810) and SAFE (0.798). Furthermore, our ablation study shows that the proposed contrastive training framework and root operand model pre-training task help our model achieve superior performance.

  • Research Article
  • Cite Count Icon 1
  • 10.32604/cmc.2023.028058
Deep Learning-Based Program-Wide Binary Code Similarity for Smart Contracts
  • Jan 1, 2023
  • Computers, Materials &amp; Continua
  • Yuan Zhuang + 5 more

Recently, security issues of smart contracts are arising great attention due to the enormous financial loss caused by vulnerability attacks. There is an increasing need to detect similar codes for hunting vulnerability with the increase of critical security issues in smart contracts. Binary similarity detection that quantitatively measures the given code diffing has been widely adopted to facilitate critical security analysis. However, due to the difference between common programs and smart contract, such as diversity of bytecode generation and highly code homogeneity, directly adopting existing graph matching and machine learning based techniques to smart contracts suffers from low accuracy, poor scalability and the limitation of binary similarity on function level. Therefore, this paper investigates graph neural network to detect smart contract binary code similarity at the program level, where we conduct instruction-level normalization to reduce the noise code for smart contract pre-processing and construct contract control flow graphs to represent smart contracts. In particular, two improved Graph Convolutional Network (GCN) and Message Passing Neural Network (MPNN) models are explored to encode the contract graphs into quantitatively vectors, which can capture the semantic information and the program-wide control flow information with temporal orders. Then we can efficiently accomplish the similarity detection by measuring the distance between two targeted contract embeddings. To evaluate the effectiveness and efficient of our proposed method, extensive experiments are performed on two real-world datasets, i.e., smart contracts from Ethereum and Enterprise Operation System (EOS) blockchain-based platforms. The results show that our proposed approach outperforms three state-of-the-art methods by a large margin, achieving a great improvement up to 6.1% and 17.06% in accuracy.

  • Conference Article
  • Cite Count Icon 36
  • 10.1145/3433210.3437533
BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network
  • May 24, 2021
  • Yuede Ji + 2 more

Binary code similarity detection, which answers whether two pieces of binary code are similar, has been used in a number of applications,such as vulnerability detection and automatic patching. Existing approaches face two hurdles in their efforts to achieve high accuracy and coverage: (1) the problem of source-binary code similarity detection, where the target code to be analyzed is in the binary format while the comparing code (with ground truth) is in source code format. Meanwhile, the source code is compiled to the comparing binary code with either a random or fixed configuration (e.g.,architecture, compiler family, compiler version, and optimization level), which significantly increases the difficulty of code similarity detection; and (2) the existence of different degrees of code similarity. Less similar code is known to be more, if not equally, important in various applications such as binary vulnerability study. To address these challenges, we design BugGraph, which performs source-binary code similarity detection in two steps. First, BugGraph identifies the compilation provenance of the target binary and compiles the comparing source code to a binary with the same provenance.Second, BugGraph utilizes a new graph triplet-loss network on the attributed control flow graph to produce a similarity ranking. The experiments on four real-world datasets show that BugGraph achieves 90% and 75% true positive rate for syntax equivalent and similar code, respectively, an improvement of 16% and 24% overstate-of-the-art methods. Moreover, BugGraph is able to identify 140 vulnerabilities in six commercial firmware.

  • Research Article
  • Cite Count Icon 1
  • 10.7717/peerj-cs.2504
MSSA: multi-stage semantic-aware neural network for binary code similarity detection
  • Jan 17, 2025
  • PeerJ Computer Science
  • Bangrui Wan + 4 more

Binary code similarity detection (BCSD) aims to identify whether a pair of binary code snippets is similar, which is widely used for tasks such as malware analysis, patch analysis, and clone detection. Current state-of-the-art approaches are based on Transformer, which require substantial computation resources. Learning-based approaches remains room for optimization in learning the deeper semantics of binary code. In this paper, we propose MSSA, a multi-stage semantic-aware neural network for BCSD at the function level. It effectively integrates the semantic and structural information of assembly instructions within and between basic blocks, and across the entire function through four semantic-aware neural networks, achieving deep understanding of binary code semantics. MSSA is a lightweight model with only 0.38M parameters in its backbone network, suitable for deployment in CPU environments. Experimental results show that MSSA outperforms Gemini, Asm2Vec, SAFE, and jTrans in classification performance and ranks second only to the Transformer-based jTrans in retrieval performance.

Save Icon
Up Arrow
Open/Close