Bin2vec: learning representations of binary executable programs for security tasks

Shushan Arakelyan,Erik Kline,Christophe Hauser,Aram Galstyan,Sima Arasteh

doi:10.1186/s42400-021-00088-4

Abstract

Tackling binary program analysis problems has traditionally implied manually defining rules and heuristics, a tedious and time consuming task for human analysts. In order to improve automation and scalability, we propose an alternative direction based on distributed representations of binary programs with applicability to a number of downstream tasks. We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs in order to learn a high dimensional representation of binary executable programs. We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks – functional algorithm classification and vulnerability discovery. We compare the proposed approach to our own strong baseline as well as published results, and demonstrate improvement over state-of-the-art methods for both tasks. We evaluated Bin2vec on 49191 binaries for the functional algorithm classification task, and on 30 different CWE-IDs including at least 100 CVE entries each for the vulnerability discovery task. We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code based inst2vec approach, while working on binary code. For almost every vulnerability class in our dataset, our prediction accuracy is over 80% (and over 90% in multiple classes).

Highlights

For many security problems, researchers are relying on binary code analysis, as they need to inspect binary executable program files without access to any source code
Our main contributions are: (i) To the best of our knowledge we are the first to suggest a distributed representation learning model approach for binary executable programs that is demonstrated to work for different downstream tasks;(ii) To this end, we present a deep learning model for modelling binary executable programs’ structure, computations, and learning their representations; (iii) To prove the concept that distributed representations for binary executable programs can be applied to downstream programs analysis tasks, we evaluate our approach on two distinct problems - functional algorithm classification and vulnerability discovery across multiple vulnerability classes, and show improvement over current state-of-the-art approaches on both
Instead of enhancing the basic blocks in Control Flow Graphs (CFGs) with a few attributes, we suggest enriching them by expanding the computations in each basic block into a computational tree, and rely on the fact that the graph embedding model will be able to capture attributes like the number of instructions if necessary

Summary

Introduction

Researchers are relying on binary code analysis, as they need to inspect binary executable program files without access to any source code This is often needed when analyzing commercial code that is protected by intellectual property and its source code is not available, but can be useful in other scenarios. Those include dealing with unsupported or legacy executables, where the information about the exact version of the source code is lost, or even the original source code itself may be lost It is frequently used for testing in order to improve the security of the system, like in black-box penetration testing when the goal is to check the binary for any weaknesses or vulnerabilities that can potentially be abused. When dealing with stripped binaries, even reconstructing function entry points can be challenging

Objectives

Results

Conclusion