In-memory fuzzing for binary code similarity analysis

Shuai Wang,Dinghao Wu

doi:10.1109/ase.2017.8115645

Abstract

Detecting similar functions in binary executables serves as a foundation for many binary code analysis and reuse tasks. By far, recognizing similar components in binary code remains a challenge. Existing research employs either static or dynamic approaches to capture program syntax or semantics-level features for comparison. However, there exist multiple design limitations in previous work, which result in relatively high cost, low accuracy and scalability, and thus severely impede their practical use. In this paper, we present a novel method that leverages in-memory fuzzing for binary code similarity analysis. Our prototype tool IMF-SIM applies in-memory fuzzing to launch analysis towards every function and collect traces of different kinds of program behaviors. The similarity score of two behavior traces is computed according to their longest common subsequence. To compare two functions, a feature vector is generated, whose elements are the similarity scores of the behavior trace-level comparisons. We train a machine learning model through labeled feature vectors; later, for a given feature vector by comparing two functions, the trained model gives a final score, representing the similarity score of the two functions. We evaluate IMF-SIM against binaries compiled by different compilers, optimizations, and commonly-used obfuscation methods, in total over one thousand binary executables. Our evaluation shows that IMF-SIM notably outperforms existing tools with higher accuracy and broader application scopes.

Full Text