Pattern-Based Vulnerability Discovery

Fabian Yamaguchi

doi:10.53846/goediss-5356

Abstract

With our increasing reliance on the correct functioning of computer systems, identifying and eliminating vulnerabilities in program code is gaining in importance. To date, the vast majority of these flaws are found by tedious manual auditing of code conducted by experienced security analysts. Unfortunately, a single missed flaw can suffice for an attacker to fully compromise a system, and thus, the sheer amount of code plays into the attacker’s cards. On the defender’s side, this creates a persistent demand for methods that assist in the discovery of vulnerabilities at scale. This thesis introduces pattern-based vulnerability discovery, a novel approach for identifying vulnerabilities which combines techniques from static analysis, machine learning, and graph mining to augment the analyst’s abilities rather than trying to replace her. The main idea of this approach is to leverage patterns in the code to narrow in on potential vulnerabilities, where these patterns may be formulated manually, derived from the security history, or inferred from the code directly. We base our approach on a novel architecture for robust analysis of source code that enables large amounts of code to be mined for vulnerabilities via traversals in a code property graph, a joint representation of a program’s syntax, control flow, and data flow. While useful to identify occurrences of manually defined patterns in its own right, we proceed to show that the platform offers a rich data source for automatically discovering and exposing patterns in code. To this end, we develop different vectorial representations of source code based on symbols, trees, and graphs, allowing it to be processed with machine learning algorithms. Ultimately, this enables us to devise three unique pattern-based techniques for vulnerability discovery, each of which address a different task encountered in day-to-day auditing by exploiting a different of the three main capabilities of unsupervised learning methods. In particular, we present a method to identify vulnerabilities similar to a known vulnerability, a method to uncover missing checks linked to security critical objects, and finally, a method that closes the loop by automatically generating traversals for our code analysis platform to explicitly express and store vulnerable programming patterns. We empirically evaluate our methods on the source code of popular and widely-used open source projects, both in controlled settings and in real world code audits. In controlled settings, we find that all methods considerably reduce the amount of code that needs to be inspected. In real world audits, our methods allow us to expose many previously unknown and often critical vulnerabilities, including vulnerabilities in the VLC media player, the instant messenger Pidgin, and the Linux kernel.

Full Text