Abstract

Identifying defective software entities is essential to ensure software quality during software development. However, the high dimensionality and class distribution imbalance of software defect data seriously affect software defect prediction performance. In order to solve this problem, this paper proposes an E nsemble M ultiBoost based on R IPPER classifier for prediction of imbalanced S oftware D efect data, called EMR_SD . Firstly, the algorithm uses principal component analysis (PCA) method to find out the most effective features from the original features of the data set, so as to achieve the purpose of dimensionality reduction and redundancy removal. Furthermore, the combined sampling method of adaptive synthetic sampling (ADASYN) and random sampling without replacement is performed to solve the problem of data class imbalance. This classifier establishes association rules based on attributes and classes, using MultiBoost to reduce deviation and variance, so as to achieve the purpose of reducing classification error. The proposed prediction model is evaluated experimentally on the NASA MDP public datasets and compared with existing similar algorithms. The results show that EMR_SD algorithm is superior to DNC, CEL and other defect prediction techniques in most evaluation indicators, which proves the effectiveness of the algorithm.

Highlights

  • Software quality [1] is considered to be extremely important in the field of software engineering

  • In this paper, considering both data and algorithm views, First of all, data are processed by the principal component analysis (PCA) feature processing method and the combined sampling method of adaptive synthetic sampling (ADASYN) and the random sampling without replacement, so as to solve software defect data redundancy and class distribution imbalance, and the rule-based RIPPER algorithm is used as the base classifier of MultiBoost ensemble learning, and the software defect prediction model is constructed to improve the prediction performance and efficiency

  • DATA SETS The data set used in this experiment is the MDP dataset from NASA, which is widely used in software defect prediction research [35]

Read more

Summary

INTRODUCTION

Software quality [1] is considered to be extremely important in the field of software engineering. He et al.: Ensemble MultiBoost Based on RIPPER Classifier for Prediction of Imbalanced Software Defect Data data sets reduce the ability of machine learning algorithms to predict minority class [3]. In this paper, considering both data and algorithm views, First of all, data are processed by the PCA feature processing method and the combined sampling method of ADASYN and the random sampling without replacement, so as to solve software defect data redundancy and class distribution imbalance, and the rule-based RIPPER algorithm is used as the base classifier of MultiBoost ensemble learning, and the software defect prediction model is constructed to improve the prediction performance and efficiency.

RELATED WORK
DATA PRE-PROCESSING
BASE CLASSIFIER OF RIPPER
MULTIBOOST ENSEMBLE CLASSIFICATION
EXPERIMENTAL DESIGN AND ANALYSIS OF RESULT
THREATS OF VALIDATION
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.