DupHunter: Detecting Duplicate Pull Requests in Fork-Based Development

He Jiang,Rong Chen,Shikai Guo,Hui Li,Yulong Li,Xiaochen Li,Tao Zhang

doi:10.1109/tse.2023.3235942

Abstract

The emergence of numerous fork-based development platforms facilitates the development of Open-Source Software (OSS) projects. Developers across the world can fork software projects and submit their Pull Requests (PRs) to the projects. However, as the number of forks increases, numerous duplicate PRs might be submitted. These duplicate PRs may cause extra code review workload and frustrate developers working on the projects. To detect duplicate PRs, many approaches have been proposed, which analyze the similarity of different elements in PRs. However, previous approaches still suffer from unsatisfied detection accuracy due to two challenges. That is, they ignore the syntactic structural information of text elements in PRs and lack the joint reasoning between different elements of two PRs. In this study, we propose an automated duplicate PRs detector named <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DupHunter ( <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Dup licate PRs <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Hunter ), which includes a graph embedding component and a duplicate PRs detection component to address the above challenges. The graph embedding component uses a feature graph to represent a PR. It encodes the syntactic structure and semantics of text elements (e.g., the title and the description), as well as the knowledge of non-text elements (e.g., the submission time), to address the syntactic structural information challenge. The duplicate PRs detection component tackles the joint reasoning challenge using a graph matching network, which enables the information exchange and matching across different elements of two feature graphs with an attention coefficient mechanism. Experiments on 26 open-source projects show that DupHunter achieves an average <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">F1-score@1 value of 0.650, significantly outperforming the state-of-the-art approaches by 3.2% to 48.1%. DupHunter can accurately detect duplicate PRs, with an average <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Precision@1 value of 0.922 and an average <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Recall@1 value of 0.502.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DupHunter: Detecting Duplicate Pull Requests in Fork-Based Development

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Software Engineering

Lead the way for us

Journal: IEEE Transactions on Software Engineering	Publication Date: Apr 1, 2023
Citations: 1

Similar Papers

Code Reviewer Intelligent Prediction in Open Source Industrial Software Project
Zhifang Liao ... Bolin Zhang
Computer Modeling in Engineering & Sciences | VOL. 137
Zhifang Liao, et. al.Zhifang Liao ... Bolin Zhang
01 Jan 2023
Computer Modeling in Engineering & Sciences | VOL. 137

Core-reviewer recommendation based on Pull Request topic model and collaborator social network
Zhifang Liao ... Xiaoping Fan
Soft Computing | VOL. 24
Zhifang Liao, et. al.Zhifang Liao ... Xiaoping Fan
15 Jul 2019
Soft Computing | VOL. 24

Opportunities and Challenges in Repeated Revisions to Pull-Requests: An Empirical Study
Zhixing Li ... Huaimin Wang
Proceedings of the ACM on Human-Computer Interaction | VOL. 6
Zhixing Li, et. al.Zhixing Li ... Huaimin Wang
07 Nov 2022
Proceedings of the ACM on Human-Computer Interaction | VOL. 6

Studying the impact of adopting continuous integration on the delivery time of pull requests
João Helis Bernardo ... Daniel Alencar Da Costa
-
João Helis Bernardo, et. al.João Helis Bernardo ... Daniel Alencar Da Costa
28 May 2018
28 May 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DupHunter: Detecting Duplicate Pull Requests in Fork-Based Development

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Software Engineering