Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and Fusion

Yang Gao,Yifei Li,Yu Wang,Lunjin Liang,Rui Li

doi:10.3390/electronics13091692

Yang Gao, Yifei Li + Show 3 more

Open Access

https://doi.org/10.3390/electronics13091692

Copy DOI

Abstract

In the landscape of software development, the selection of compilation tools and settings plays a pivotal role in the creation of executable binaries. This diversity, while beneficial, introduces significant challenges for reverse engineers and security analysts in deciphering the compilation provenance of binary code. To this end, we present MulCPI, short for Multi-representation Fusion-based Compilation Provenance Identification, which integrates the features collected from multiple distinct intermediate representations of the binary code for better discernment of the fine-grained function-level compilation details. In particular, we devise a novel graph-oriented neural encoder improved upon the gated graph neural network by subtly introducing an attention mechanism into the neighborhood nodes’ information aggregation computation, in order to better distill the more informative features from the attributed control flow graph. By further integrating the features collected from the normalized assembly sequence with an advanced Transformer encoder, MulCPI is capable of capturing a more comprehensive set of features manifesting the multi-faceted lexical, syntactic, and structural insights of the binary code. Extensive evaluation on a public dataset comprising 854,858 unique functions demonstrates that MulCPI exceeds the performance of current leading methods in identifying the compiler family, optimization level, compiler version, and the combination of compilation settings. It achieves average accuracy rates of 99.3%, 96.4%, 90.7%, and 85.3% on these tasks, respectively. Additionally, an ablation study highlights the significance of MulCPI’s core designs, validating the efficiency of the proposed attention-enhanced gated graph neural network encoder and the advantages of incorporating multiple code representations.

Full Text