Similarity Detection Techniques Research Articles

The binary code similarity detection (BCSD) technique can quantitatively measure the differences between two given binaries and give matching results at predefined granularity (e.g., function), and has been widely used in multiple scenarios including software vulnerability search, security patch analysis, malware detection, code clone detection, etc. With the help of deep learning, the BCSD techniques have achieved high accuracy in their evaluation. However, on the one hand, their high accuracy has become indistinguishable due to the lack of a standard dataset, thus being unable to reveal their abilities. On the other hand, since binary code can be easily changed, it is essential to gain a holistic understanding of the underlying transformations including default optimization options, non-default optimization options, and commonly used code obfuscations, thus assessing their impact on the accuracy and adaptability of the BCSD technique. This paper presents our observations regarding the diversity of BCSD datasets and proposes a comprehensive dataset for the BCSD technique. We employ and present detailed evaluation results of various BCSD works, applying different classifications for different types of BCSD tasks, including pure function pairing and vulnerable code detection. Our results show that most BCSD works are capable of adopting default compiler options but are unsatisfactory when facing non-default compiler options and code obfuscation. We take a layered perspective on the BCSD task and point to opportunities for future optimizations in the technologies we consider.

Read full abstract

Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.

Read full abstract

Similarity Detection Techniques Research Articles

Articles published on Similarity Detection Techniques

BinCodex: A comprehensive and multi-level dataset for evaluating binary code similarity detection techniques

Transformer induced enhanced feature engineering for contextual similarity detection in text

Rapid detection of identity-by-descent tracts for mega-scale datasets

A Scalable Code Similarity Detection with Online Architecture and Focused Comparison for Maintaining Academic Integrity in Programming

Empirical comparison of text-based mobile apps similarity measurement techniques

A comparison of code similarity analysers

Event triggered intelligent video recording system using MS-SSIM for smart home security

Viewing functions as token sequences to highlight similarities in source code

A protocol-independent technique for eliminating redundant network traffic

A stochastic model of TCP/IP with stationary random losses

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Similarity Detection Techniques Research Articles

Articles published on Similarity Detection Techniques

BinCodex: A comprehensive and multi-level dataset for evaluating binary code similarity detection techniques

Transformer induced enhanced feature engineering for contextual similarity detection in text

Rapid detection of identity-by-descent tracts for mega-scale datasets

A Scalable Code Similarity Detection with Online Architecture and Focused Comparison for Maintaining Academic Integrity in Programming

Empirical comparison of text-based mobile apps similarity measurement techniques

A comparison of code similarity analysers

Event triggered intelligent video recording system using MS-SSIM for smart home security

Viewing functions as token sequences to highlight similarities in source code

A protocol-independent technique for eliminating redundant network traffic

A stochastic model of TCP/IP with stationary random losses