CVE-assisted large-scale security bug report dataset construction method

Xiaoxue Wu,Wei Zheng,Xiang Chen,Fang Wang,Dejun Mu

doi:10.1016/j.jss.2019.110456

Abstract

Identifying SBRs (security bug reports) is crucial for eliminating security issues during software development. Machine learning are promising ways for SBR prediction. However, the effectiveness of the state-of-the-art machine learning models depend on high-quality datasets, while gathering large-scale datasets are expensive and tedious. To solve this issue, we propose an automated data labeling approach based on iterative voting classification. It starts with a small group of ground-truth traing samples, which can be labeled with the help of authoritative vulnerability records hosted in CVE (Common Vulnerabilities and Exposures). The accuracy of the prediction model is improved with an iterative voting strategy. By using this approach, we label over 80k bug reports from OpenStack and 40k bug reports from Chromium. The correctness of these labels are then manually reviewed by three experienced security testing members. Finally, we construct a large-scale SBR dataset with 191 SBRs and 88,472 NSBRs (non-security bug reports) from OpenStack; and improve the quality of existing SBR dataset Chromium by identifying 64 new SBRs from previously labeled NSBRs and filtering out 173 noise bug reports from this dataset. These share datasets as well as the proposed dataset construction method help to promote research progress in SBR prediction research domain.

Full Text