Return Instruction Classification in Binary Code Using Machine Learning

Jing Qiu,Feng Dong,Xiaoxu Geng

doi:10.1142/s0218194022500565

Abstract

Binary code analysis is vital in source code unavailable cases, such as malware analysis and software vulnerability mining. Its first step could be function identification. Most function identification methods are based on function prologs/epilogs. However, functions may not have standard prologs/epilogs. To identify these functions, we need to use other methods. One approach is to identify return instructions first and then identify the start of a function. Currently, the multi-layer perceptron model is exploited to identify and validate a return instruction at a specific location. On this basis, a new approach is proposed to improve accuracy and provide more details. Specifically, a return instruction is classified into three classes: (1) false return instruction, (2) true return instruction inner a function but not the last instruction, and (3) true return instruction at the end of a function. The evaluation is performed on 5782 real-world binaries. Meanwhile, common classifiers including fully connected neural network, Two-layer Bidirectional Recurrent Neural Network (TBRNN), Two-layer Bidirectional Gate Recurrent Unit (TBGRU), Two-layer Bidirectional Long Short-term Memory Network (TBLSTM), Decision Tree, Random Forest, XGBoost, and Support Vector Machine (SVM) are evaluated on the same data set. The result shows that TBLSTM achieves an accuracy of 99.78%, which is higher than that of other classifiers in the evaluation, including the state-of-the-art tool IDA Pro 7.7.

Full Text