Abstract

Automatic identification of the differences between two versions of a file is a common and basic task in several applications of mining code repositories. Git, a version control system, has a diff utility and users can select algorithms of diff from the default algorithm Myers to the advanced Histogram algorithm. From our systematic mapping, we identified three popular applications of diff in recent studies. On the impact on code churn metrics in 14 Java projects, we obtained different values in 1.7% to 8.2% commits based on the different diff algorithms. Regarding bug-introducing change identification, we found 6.0% and 13.3% in the identified bug-fix commits had different results of bug-introducing changes from 10 Java projects. For patch application, we found that the Histogram is more suitable than Myers for providing the changes of code, from our manual analysis. Thus, we strongly recommend using the Histogram algorithm when mining Git repositories to consider differences in source code.

Highlights

  • The diff utility calculates and displays the differences between two files, and is typically used to investigate the changes between two versions of the same file

  • Since understanding and measuring changes in software artifact is essential in empirical software engineering research, diff is commonly used in various topics, such as defect prediction where code churn (Nagappan and Ball 2005; Shin et al 2011) and process metrics

  • Based on previous related studies, we investigate the code changes from the files in 14 open source software (OSS) projects that employ Continuous Integration for metrics collection and 10 Apache projects for the bug introduction identification to quantify the differences of the diff outputs that resulted from both diff algorithms

Read more

Summary

Introduction

The diff utility calculates and displays the differences between two files, and is typically used to investigate the changes between two versions of the same file. Madeyski and Jureczko 2015; Kamei and Shihab 2016) are used, code authorship (Rahman and Devanbu 2011; Meng et al 2013), clone genealogy (Kim et al 2005; Duala-Ekoko and Robillard 2007), and empirical studies of changes (Barr et al 2014; Ray et al 2015). A version control system, offers diff utility for users to select the algorithms of diff. Git offers four diff algorithms, namely, Myers, Minimal, Patience, and Histogram. Myers is used as the default algorithm

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.