Effective software vulnerability detection is paramount for ensuring the security of software systems. However, the presence of imbalanced data in extensive datasets often leads to overfitting on non-vulnerable code and suboptimal performance on vulnerable code. Traditional methods of collecting vulnerable data frequently fall short in capturing the complexities of real-world scenarios. This paper proposes a mutation-based data enhancement approach to tackle this challenge, with a focus on capturing essential traits of vulnerable source code. Our approach systematically extracts traits from extensive vulnerable source code and employs mutation operators to introduce high-level alterations. We evaluate the convergence of multiple mutation rounds using a diversity index to ensure consistent enhancements. By selecting the most effective mutation operators for subsequent model training, our approach achieves substantial accuracy improvements across diverse datasets and deep neural network models. This work represents the initial version of our approach, with continuous refinements underway to facilitate practical implementation and address real-world security challenges.
Read full abstract