Abstract

The efficiency (prediction accuracy) of a classification model is affected by the quality of training data. High dimensionality and class imbalance are two main problems that may cause low quality of training datasets, making data preprocessing a very important step for a classification problem. Feature (software metric) selection and data sampling are frequently used to overcome these problems. Feature selection (FS) is a process of selecting the most important attributes from the original dataset. Data sampling copes with class imbalance by adding/removing instances to/from training datasets. Another interesting method, called boosting (building multiple models, with each model tuned to work better on instances misclassifled by previous models), is found also effective for addressing the class imbalance problem. In this study, we investigate two types of FS approaches: individual FS and repetitive sampled FS. Following feature selection, models are built either using a plain learner or using a boosting algorithm, where random undersampling integrates with the AdaBoost algorithm. We focus on studying the impact of two FS methods (individual FS vs. repetitive sampled FS) and two model-building processes (boosting vs. plain learner) on software quality prediction. Six feature ranking techniques are examined in the experiment. The results demonstrate that the repetitive sampled FS generally has better performance than the individual FS technique when a plain learner is used for the subsequent learning process, and that boosting is more effective in improving classification performance than not using boosting.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call