Binary analysis is useful in many practical applications, such as the detection of malware or vulnerable software components. However, our survey of the literature shows that most existing binary analysis tools and frameworks rely on assumptions about specific compilers and compilation settings. It is well known that techniques such as refactoring and light obfuscation can significantly alter the structure of code, even for simple programs. Applying such techniques or changing the compiler and compilation settings can significantly affect the accuracy of available binary analysis tools, which severely limits their practicability, especially when applied to malware. To address these issues, we propose a novel technique that extracts the semantics of binary code in terms of both data and control flow. Our technique allows more robust binary analysis because the extracted semantics of the binary code is generally immune from light obfuscation, refactoring, and varying the compilers or compilation settings. Specifically, we apply data-flow analysis to extract the semantic flow of the registers as well as the semantic components of the control flow graph, which are then synthesized into a novel representation called the semantic flow graph (SFG). Subsequently, various properties, such as reflexive, symmetric, antisymmetric, and transitive relations, are extracted from the SFG and applied to binary analysis. We implement our system in a tool called BinGold and evaluate it against thirty binary code applications. Our evaluation shows that BinGold successfully determines the similarity between binaries, yielding results that are highly robust against light obfuscation and refactoring. In addition, we demonstrate the application of BinGold to two important binary analysis tasks: binary code authorship attribution, and the detection of clone components across program executables. The promising results suggest that BinGold can be used to enhance existing techniques, making them more robust and practical.
Read full abstract