Abstract

The technique of binary code similarity detection (BCSD) has been applied in many fields, such as malware detection, plagiarism detection and vulnerability search, etc. Existing solutions for the BCSD problem usually compare specific features between binaries based on the control flow graphs of functions from binaries or compute the embedding vector of binary functions and solve the problem based on deep learning algorithms. In this paper, from another research perspective, we propose a new and lightweight method to solve cross-version BCSD problem based on multiple features. It transforms binary functions into vectors and signals and computes the similarity coefficient value and correlation coefficient value for solving cross-version BCSD problem. Without relying on the CFG of functions, deep learning algorithms and other related attributes, our method works directly on the raw bytes of each binary and it can be used as an alternative method to coping with various complex situations that exist in the real-world environment. We implement the method and evaluate it on a custom dataset with about 423,282 samples. The result shows that the method could perform well in cross-version BCSD field, and the recall of our method could reach 96.63%, which is almost the same as the state-of-the-art static solution.

Highlights

  • Evaluating whether two binary functions are similar or not is known as binary code similarity detection (BCSD), which has been applied in many fields, such as malware detection [1], [2], malware family analysis [3] and plagiarism detection [4], [5]

  • MAJOR CONTRIBUTIONS OF THE STUDY Aiming at these problems, we propose a method for cross-version binary code similarity detection

  • Together with various solutions such as Bindiff, Gemini, Alpha-diff, etc., our method can be used as an alternative solution to cope with various complex situations that exist in the real-world environment

Read more

Summary

INTRODUCTION

Evaluating whether two binary functions are similar or not is known as binary code similarity detection (BCSD), which has been applied in many fields, such as malware detection [1], [2], malware family analysis [3] and plagiarism detection [4], [5]. NEED FOR THIS STUDY These solutions rely on the CFG of functions, which was derived from expertise to construct semantic features of binaries These methods have achieved good research results in multiple code similarity detection fields and have many important applications. The result shows that the recall of the method could still reach 96.63% It proves that from another research perspective, our method could perform well in the field of cross-version binary code similarity detection without relying on CFG, deep learning algorithm and other related attributes. We make the following contributions: 1) Without relying on the CFG of functions, deep learning algorithms and other related attributes, we propose a new and light-weight method to extract features from the raw bytes of functions and solve cross-version BCSD problems, which could solve some limitations that may exist in the existing research.

PROBLEM DEFINITION
8: Divide f1i into D parts
EVALUATION METRIC
PARAMETERS IN THE METHOD
DISCUSSION
LIMITATIONS AND FUTURE
VIII. CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call