Abstract

Software metrics are widely-used indicators of software quality and several studies have shown that such metrics can be used to estimate the presence of vulnerabilities in the code. In this paper, we present a comprehensive experiment to study how effective software metrics can be to distinguish the vulnerable code units from the non-vulnerable ones. To this end, we use several machine learning algorithms (Random Forest, Extreme Boosting, Decision Tree, SVM Linear, and SVM Radial) to extract vulnerability-related knowledge from software metrics collected from the source code of several representative software projects developed in C/C++ (Mozilla Firefox, Linux Kernel, Apache HTTPd, Xen, and Glibc). We consider different combinations of software metrics and diverse application scenarios with different security concerns (e.g., highly critical or non-critical systems). This experiment contributes to understanding whether software metrics can effectively be used to distinguish vulnerable code units in different application scenarios, and how can machine learning algorithms help in this regard. The main observation is that using machine learning algorithms on top of software metrics helps to indicate vulnerable code units with a relatively high level of confidence for security-critical software systems (where the focus is on detecting the maximum number of vulnerabilities, even if false positives are reported), but they are not helpful for low-critical or non-critical systems due to the high number of false positives (that bring an additional development cost frequently not affordable).

Highlights

  • Several research studies show that software defects/vulnerabilities (e.g., Buffer overflow, SQL injection) are a central and critical source of security breaches [1]–[3] in computer systems

  • This study considers several commonly used machine learning (ML) algorithms (Random Forest, Extreme Boosting, Decision Tree, Support Vector Machine (SVM) Linear and SVM Radial) that are applied on software metrics of all types (e.g., Cyclomatic Complexity, Lines of Code, and Coupling Between Objects) collected from the source code of several widely used and representative software projects developed in C/C++ (Mozilla Firefox, Linux Kernel, Apache HTTPd, Xen and Glibc) at different levels

  • WORK This paper presented a comprehensive study on the use of software metrics and machine learning algorithms for the detection/prediction of vulnerable code

Read more

Summary

Introduction

Several research studies show that software defects/vulnerabilities (e.g., Buffer overflow, SQL injection) are a central and critical source of security breaches [1]–[3] in computer systems. Organizations, and critical infrastructures are backed by software systems executing critical operations and transactions, providing services and dealing with huge amounts of sensitive data for supporting effective decisions and constant business/system adaptation. This tremendously increased concerns regarding security, driving researchers and businesses to come up with tools, techniques, standards, and regulations to help developers to ensure security in software systems [13], [14]. Sensei [30] is another example that tries to enforce secure coding guidelines in the integrated development environment It is still very difficult for developers, if not impossible, to build software without vulnerabilities. This has led to many works trying to mitigate the damage that such vulnera-

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call