Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning

Hamzeh Eyal Salman,Abdelhak-Djamel Seriai,Zakarea Alshara

doi:10.3390/info13020073

Abstract

Context: In a social coding platform such as GitHub, a pull-request mechanism is frequently used by contributors to submit their code changes to reviewers of a given repository. In general, these code changes are either to add a new feature or to fix an existing bug. However, this mechanism is distributed and allows different contributors to submit unintentionally similar pull-requests that perform similar development activities. Similar pull-requests may be submitted to review in parallel time by different reviewers. This will cause redundant reviewing time and efforts. Moreover, it will complicate the collaboration process. Objective: Therefore, it is useful to assign similar pull-requests to the same reviewer to be able to decide which pull-request to choose in effective time and effort. In this article, we propose to group similar pull-requests together into clusters so that each cluster is assigned to the same reviewer or the same reviewing team. This proposal allows saving reviewing efforts and time. Method: To do so, we first extract descriptive textual information from pull-requests content to link similar pull-requests together. Then, we employ the extracted information to find similarities among pull-requests. Finally, machine learning algorithms (K-Means clustering and agglomeration hierarchical clustering algorithms) are used to group similar pull-requests together. Results: To validate our proposal, we have applied it to twenty popular repositories from public dataset. The experimental results show that the proposed approach achieved promising results according to the well-known metrics in this subject: precision and recall. Furthermore, it helps to save the reviewer time and effort. Conclusion: According to the obtained results, the K-Means algorithm achieves 94% and 91% average precision and recall values over all considered repositories, respectively, while agglomeration hierarchical clustering performs 93% and 98% average precision and recall values over all considered repositories, respectively. Moreover, the proposed approach saves reviewing time and effort on average between (67% and 91%) by K-Means algorithm and between (67% and 83%) by agglomeration hierarchical clustering algorithm.

Highlights

IntroductionIn a social coding platforms such as GitHub, contributors (developers) frequently use Pull-Request (PR) mechanisms to submit their code changes to reviewers or owners of a given software project (repository) [1,2].These changes include development activities (e.g., adding new functional features) [3,4], fixing errors in an existing project [5] or for improvements (in terms of performance, usability, reliability, and so on)
In a social coding platforms such as GitHub, contributors frequently use Pull-Request (PR) mechanisms to submit their code changes to reviewers or owners of a given software project [1,2].These changes include development activities [3,4], fixing errors in an existing project [5] or for improvements
Our proposal makes the following contributions: 1. An automatic approach to cluster similar PRs together using two supervised and unsupervised machine learning (ML) algorithms considering the number of reviewers or repository’s owner preferences

Summary

Introduction

In a social coding platforms such as GitHub, contributors (developers) frequently use Pull-Request (PR) mechanisms to submit their code changes to reviewers or owners of a given software project (repository) [1,2].These changes include development activities (e.g., adding new functional features) [3,4], fixing errors in an existing project [5] or for improvements (in terms of performance, usability, reliability, and so on). Each contributor independently clones or forks the original repository and makes their changes to that fork [6]. The contributor creates a PR to package and submit their changes to core reviewers of the original repository. The content of these submitted PRs passes through several rounds of code reviews and discussions. PRs let contributors show their changes and additional information (e.g., build and test results) with other team members, so they can review and discuss these changes before accept, give some comments, or reject them. When changes are approved, the changes are merged into the main repository [17]

Objectives

Results

Conclusion