Abstract

Software defect prediction can assist developers in finding potential bugs and reducing maintenance cost. Traditional approaches usually utilize software metrics (Lines of Code, Cyclomatic Complexity, etc.) as features to build classifiers and identify defective software modules. However, software metrics often fail to capture programs’ syntax and semantic information. In this paper, we propose Seml, a novel framework that combines word embedding and deep learning methods for defect prediction. Specifically, for each program source file, we first extract a token sequence from its abstract syntax tree. Then, we map each token in the sequence to a real-valued vector using a mapping table, which is trained with an unsupervised word embedding model. Finally, we use the vector sequences and their labels (defective or non-defective) to build a Long Short Term Memory (LSTM) network. The LSTM model can automatically learn the semantic information of programs and perform defect prediction. The evaluation results on eight open source projects show that Seml outperforms three state-of-the-art defect prediction approaches on most of the datasets for both within-project defect prediction and cross-project defect prediction.

Highlights

  • Software defect prediction techniques are proposed to improve software reliability and reduce software development cost

  • Several machine learning models have been adopted as defect prediction classifiers, such as Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT), Neural Network (NN), etc

  • We present a preprocessing method for tokens extracted from programs’ Abstract Syntax Trees (ASTs) and train a word embedding model in an unsupervised way to map tokens as real-valued vectors, in order to capture semantic similarities of tokens for both within-project defect prediction (WPDP) and cross-project defect prediction (CPDP)

Read more

Summary

INTRODUCTION

Software defect prediction techniques are proposed to improve software reliability and reduce software development cost. Most previous studies leverage manually designed software metrics to build classifiers. Traditional approaches with metrics have made progress in both within-project defect prediction and cross-project defect prediction. They are facing a challenge that manually designed metrics fail to capture programs’ rich syntax and semantic information, which may limit the performance of defect prediction. The two files share the same metrics (Lines of Code, Cyclomatic Complexity, etc.) and traditional defect prediction approaches can’t tell the difference between them. To capture programs’ syntax and semantic information, Wang et al [11] proposed a deep learning approach, which leverages Deep Belief Network (DBN) [12] to learn semantic features from token sequences extracted from programs’

DEFECT PREDICTION
WORD EMBEDDING
APPROACH
PARSING SOURCE CODE AND EXTRACTING FEATURES
TOKEN EMBEDDING
BUILDING LSTM MODEL AND PERFORMING DEFECT
EVALUATION
DATASETS
EVALUATION METRICS
BASELINES
PARAMETERS TUNING
SOFTWARE DEFECT PREDICTION
DEEP LEARNING AND SOFTWARE ENGINEERING
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call