Software Fault Prediction for Imbalanced Data: A Survey on Recent Developments

Sanchita Pandey,Kuldeep Kumar

doi:10.1016/j.procs.2023.01.159

Abstract

The method of recognizing faults in a software system is acknowledged as software fault prediction. Software faults predicted in prior stages help in the management of resources and time required during software testing and maintenance. The identified software module can be fixed ahead of time, saving time and money near the end of the software development process. Over the years, various supervised machine learning-based techniques for fault prediction have been suggested. These models’ accuracy is based on the training datasets. The models are created and trained using a labeled dataset consisting of multiple independent variables like lines of codes, the complexity of the software, the size of the software, etc., and a dependent binary variable, either true or false. But the fault dataset may have some concerns like a class overlapping problem, class imbalance problem, null values, etc. Recent research in software fault prediction focuses on data quality. An imbalanced dataset is one in which one of the class data is present in the majority and another class data is present in the minority. Models built using imbalanced datasets are biased which results in inaccurate predictions. Therefore, balancing the dataset is important. In this paper, the most recent software fault prediction algorithms, which focus on class imbalance issues are discussed. A comparative presentation is presented in this paper, which would benefit the scholar in selecting the best techniques of fault prediction based on different datasets and algorithms. According to the survey, SMOTE is the most commonly used data sampling technique for dealing with data quality issues.

Full Text