Neural Transfer Learning for Repairing Security Vulnerabilities in C Code

Zimin Chen,Martin Monperrus,Steve Kommrusch

doi:10.1109/tse.2022.3147265

Zimin Chen, Martin Monperrus + Show 1 more

Open Access

https://doi.org/10.1109/tse.2022.3147265

Copy DOI

Abstract

In this paper, we address the problem of automatic repair of software vulnerabilities with deep learning. The major problem with data-driven vulnerability repair is that the few existing datasets of known confirmed vulnerabilities consist of only a few thousand examples. However, training a deep learning model often requires hundreds of thousands of examples. In this work, we leverage the intuition that the bug fixing task and the vulnerability fixing task are related and that the knowledge learned from bug fixes can be transferred to fixing vulnerabilities. In the machine learning community, this technique is called transfer learning. In this paper, we propose an approach for repairing security vulnerabilities named VRepair which is based on transfer learning. VRepair is first trained on a large bug fix corpus and is then tuned on a vulnerability fix dataset, which is an order of magnitude smaller. In our experiments, we show that a model trained only on a bug fix corpus can already fix some vulnerabilities. Then, we demonstrate that transfer learning improves the ability to repair vulnerable C functions. We also show that the transfer learning model performs better than a model trained with a denoising task and fine-tuned on the vulnerability fixing task. To sum up, this paper shows that transfer learning works well for repairing security vulnerabilities in C compared to learning on a small dataset.

Highlights

O N the code hosting platform GitHub, the number of newly created code repositories has increased by 35% in 2020 compared to 2019, reaching 60 million new repositories during 2020 [1]
We empirically demonstrate that on the vulnerability fixing task, the transfer learning VRepair model performs better than the alternatives: 1) VRepair is better than a model trained only on the small vulnerability fix dataset; 2) VRepair is better than a model trained on a large generic bug fix corpus; 3) VRepair is better than a model pre-trained with a denoising task
The second column shows the performance of the transfer learning model, which is a model trained on the large bug fix corpus, and tuned with the vulnerability fix examples

Summary

Introduction

O N the code hosting platform GitHub, the number of newly created code repositories has increased by 35% in 2020 compared to 2019, reaching 60 million new repositories during 2020 [1]. On the other hand, training neural models for a translation task (English to French) has been done using over 41 million sentence pairs [10]. Another example is the popular summarization dataset CNN-DM [11] that contains 300 thousand text pairs. One common kind of vulnerability is a buffer overflow, which allows an attacker to overwrite a buffer’s boundary and inject malicious code. Another example is an SQL injection, where malicious SQL statements are inserted into executable queries. Each vulnerability with a CVE ID is assigned to a Common Weakness Enumeration (CWE) category representing the generic type of problem

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Software Engineering	Publication Date: Jan 1, 2023
Citations: 43	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Neural Transfer Learning for Repairing Security Vulnerabilities in C Code

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Software Engineering

Lead the way for us

Similar Papers

A Novel QCT-Based Deep Transfer Learning Approach for Predicting Stiffness Tensor of Trabecular Bone Cubes
Pengwei Xiao ... Xiaodu Wang
IRBM | VOL. 45
Pengwei Xiao, et. al.Pengwei Xiao ... Xiaodu Wang
18 Mar 2024
IRBM | VOL. 45

A Comparative Study of Transfer Learning and Fine-Tuning Method on Deep Learning Models for Wayang Dataset Classification
Ahmad Mustafid ... Siti Helmiyah
IJID (International Journal on Informatics for Development) | VOL. 9
Ahmad Mustafid, et. al.Ahmad Mustafid ... Siti Helmiyah
31 Dec 2020
IJID (International Journal on Informatics for Development) | VOL. 9

A hybrid approach based on transfer and ensemble learning for improving performances of deep learning models on small datasets
Tunç Gültekin ... Aybars Uğur
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES | VOL. 29
Tunç Gültekin, et. al.Tunç Gültekin ... Aybars Uğur
30 Nov 2021
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES | VOL. 29

Learning from unlabelled real seismic data: Fault detection based on transfer learning
Ruoshui Zhou ... Xingmiao Yao
Geophysical Prospecting | VOL. 69
Ruoshui Zhou, et. al.Ruoshui Zhou ... Xingmiao Yao
06 Jun 2021
Geophysical Prospecting | VOL. 69

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Neural Transfer Learning for Repairing Security Vulnerabilities in C Code

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Software Engineering