Abstract

Exploitable vulnerabilities in software systems are major security concerns. To date, machine learning (ML) based solutions have been proposed to automate and accelerate the detection of vulnerabilities. Most ML techniques aim to isolate a unit of source code, be it a line or a function, as being vulnerable. We argue that a code segment is vulnerable if it exists in certain semantic contexts, such as the control flow and data flow; therefore, it is important for the detection to be context aware. In this paper, we evaluate the performance of mainstream word embedding techniques in the scenario of software vulnerability detection. Based on the evaluation, we propose a supervised framework leveraging pre-trained context-aware embeddings from language models (ELMo) to capture deep contextual representations, further summarized by a bidirectional long short-term memory (Bi-LSTM) layer for learning long-range code dependency. The framework takes directly a source code function as an input and produces corresponding function embeddings, which can be treated as feature sets for conventional ML classifiers. Experimental results showed that the proposed framework yielded the best performance in its downstream detection tasks. Using the feature representations generated by our framework, random forest and support vector machine outperformed four baseline systems on our data sets, demonstrating that the framework incorporated with ELMo can effectively capture the vulnerable data flow patterns and facilitate the vulnerability detection task.

Highlights

  • The rapid increase in the number of disclosed software vulnerabilities has posed a huge security threat to the cyberspace worldwide [1,2,3,4,5]

  • To combat the potential cyber threats caused by exploitable vulnerabilities in software, machine learning (ML) and data-driven based approaches have been proposed for bug/vulnerability detection [6,7,8]

  • The application of deep learning for code analysis requires the transformation of software code to vector representations recognizable by DL algorithms; this is still challenging [15]

Read more

Summary

Introduction

The rapid increase in the number of disclosed software vulnerabilities has posed a huge security threat to the cyberspace worldwide [1,2,3,4,5]. Source code can be encoded at the token level [18,19], or at character level [20,21], and processed as text These techniques generally represent individual words as atomic units, ignoring the similarity between words and the relationship among them, causing difficulties for the downstream algorithms to learn expressive and rich semantics related to context. These issues were tackled by distributed word embedding techniques, such as Word2Vec [22], GloVe (global vectors for word representation) [23], and FastText [24]. Techniques such as Word2Vec can learn word probability based on contextual information, are capable of capturing word similarities, and form the foundation for many existing studies that require the learning of code semantics for code analysis tasks, such as vulnerability detection [5,14,25,26,27,28]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call