Review Sharing via Deep Semi-Supervised Code Clone Detection

Chenkai Guo,Dengrong Huang,Hui Yang,Jingwen Zhu,Naipeng Dong,Jing Xu,Jianwen Zhang

doi:10.1109/access.2020.2966532

Chenkai Guo, Dengrong Huang + Show 5 more

Open Access

https://doi.org/10.1109/access.2020.2966532

Copy DOI

Abstract

Code review as a typical type of user feedback has recently drawn increasing attentions for improving code quality. To carry out research on code review, sufficient review data is normally required. As a result, recent efforts commonly focus on analysis for projects with sufficient reviews (called “s-projects”), rather than projects with extremely few ones (called “f-projects”). Actually, through statistics on public platforms, the latter ones dominate open source software, in which novel approaches should be explored to improve their review-based code improvement. In this paper, we try to address the problem via building a review sharing channel where the informative review can be reasonably delivered from s-projects to the f-projects. To ensure the accuracy of shared reviews, we introduce a novel code clone detection model based on Convolutional Neural Network (CNN), and build suitable “s-projects, f-projects” pairs through the clone detection. Especially, to alleviate the dataset heterogeneity between the training and testing, an autoencoder-based semi-supervised learning strategy is employed. Furthermore, to improve the sharing experience, heuristic filtering tactics are applied to reduce the time cost. Meanwhile, the LDA (Latent Dirichlet Allocation)-based ranking algorithm is used for presenting diverse review themes. We have implemented the sharing channel as a prototype system RSharer+, which contains three representative modules: data preprocessing, code clone detection and review presentation. The collected datasets are first transformed into context-sensitive numerical vectors in the data proprecessing. Then in the clone detection, data vectors are trained and tested on the BigCloneBench and real code-review pairs. At last, the presentation module provides review classification and theme extraction for better sharing experience. Extensive comparative experiments on hundreds of real labelled code fragments demonstrate the precision of clone detection and the effectiveness of review sharing.

Highlights

Code review as typical type of user feedback has drawn increasing attentions in the research field of code improvement lately
We randomly select 20k code pairs across the code fragments to participate in the semi-supervised Convolutional Neural Network (CNN) model training
We investigated the state-of-the-art code clone detection works, including SourcererCC [29], NiCad [28], Deckard [33], CClearner [34], and Oreo [66]

Summary

Introduction

Code review as typical type of user feedback has drawn increasing attentions in the research field of code improvement lately. When focusing on the review-based applications, researchers are prone to overlooking a latent but critical. The associate editor coordinating the review of this manuscript and approving it for publication was Xiaobing Sun. threat: not all software projects have sufficient reviews. The lack of reviews for the majority of code projects seriously limit the usability and effectiveness of existing review-based code analysis. Some review generation solutions have been proposed recently. Xia et al [10], [11] proposed DeepCom, which adopts deep encoder-decoder mechanism to automatically generate code comments for Java

Objectives

Methods

Conclusion