Abstract

Binary-binary function matching problem serves as a plinth in many reverse engineering techniques such as binary diffing, malware analysis, and code plagiarism detection. In literature, function matching is performed by first extracting function features (syntactic and semantic), and later these features are used as selection criteria to formulate an approximate 1:1 correspondence between binary functions. The accuracy of the approximation is dependent on the selection of efficient features. Although substantial research has been conducted on this topic, we have explored two major drawbacks in previous research. (i) The features are optimized only for a single architecture and their matching efficiency drops for other architectures. (ii) function matching algorithms mainly focus on the structural properties of a function, which are not inherently resilient against compiler optimizations. To resolve the architecture dependency and compiler optimizations, we benefit from the intermediate representation (IR) of function assembly and propose a set of syntactic and semantic (embedding-based) features which are efficient for multi-architectures, and sensitive to compiler-based optimizations. The proposed function matching algorithm employs one-shot encoding that is flexible to small changes and uses a KNN based approach to effectively map similar functions. We have evaluated proposed features and algorithms using various binaries, which were compiled for ×86 and ARM architectures; and the prototype implementation is compared with Diaphora (an industry-standard tool), and other baseline research. Our proposed prototype has achieved a matching accuracy of approx. 96%, which is higher than the compared tools and consistent against optimizations and multi-architecture binaries.

Highlights

  • An assembly function can have either a graphical representation – like its control flow graph (CFG), call graph (CG), data flow graph (DFG), etc. or a textual representation – like its assembly code, intermediate representation (IR) code, embedding vectors, etc

  • The evaluation parameters are the count of true positives, false positives for the matched functions, and count of unmatched functions

  • The improved efficiency of the proposed model is linked with the two factors (i) it uses the evaluable strings intermediate language (ESIL) as IR that filter the assembly noise and splits the IR statements into tokens which are finite in nature and can be efficiently learned with a fewer dataset. (ii) Algorithm 3 based function vector computation respects the structural properties of functions and the model output vector captures diverse properties and efficient

Read more

Summary

Introduction

An assembly function can have either a graphical representation – like its control flow graph (CFG), call graph (CG), data flow graph (DFG), etc. or a textual representation – like its assembly code, intermediate representation (IR) code, embedding vectors, etc. Considering the NP nature of graph isomorphism, most of the existing function matching research [4]–[8] extract syntactic features from the structural representation and formulate an approximate solution based on extracted features. As a diverse feature set can boost the accuracy of function matching, research works [6], [9], [10] extend their feature set by extracting the semantic features from the function assembly code. Research works like FOSSIL [11] adopt an alternative approach to resolve graph isomorphism. They opt to boost the efficiency of CFG matching (graph based) but these techniques are computationally expensive (NP class), not evaluated for the multi-architecture binary matching problem, and out of the scope of this research

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.