Learning Stretch-Shrink Latent Representations With Autoencoder and K-Means for Software Defect Prediction

Viet Anh Phan

doi:10.1109/access.2022.3219589

Abstract

Detecting defective source code to localize and fix bugs is important to reduce software development efforts. Although deep learning models have made a breakthrough in this field, many issues have not been resolved, such as labeled data shortage and the small size of defective elements. Given two similar programs that differ from each other by an operator or statement, one may be clean while the other may be defective. To address these issues, this study proposes a new deep learning model to facilitate the learning of distinguishing features. The model comprises of three main components: 1) a convolutional neural network-based classifier, 2) an autoencoder, and 3) a k-means cluster. In our model, the autoencoder assists the classifier in generating program latent representations. The k-means cluster provides penalty functions to increase the distinguishability among latent representations. We evaluated the effectiveness of the model according to performance metrics and latent representation quality. The experimental results on the four defect prediction datasets show that the proposed model outperforms the baselines thanks to the generation of sophisticated features.

Full Text