ContextIn Empirical Software Engineering, it is crucial to work with representative samples that reflect the current state of the software industry. An important consideration, especially in rapidly changing fields like software development, is that if we use a sample collected years ago, it should continue to represent the same population in the present day to produce generalizable results. However, it is seldom the case in which a software sample built several years ago accurately depicts the current state of the development industry. Nevertheless, many recent studies rely on rather old datasets (seven or more years of age) to conduct their investigations. ObjectiveTo analyze the evolution of a population of open-source projects, determine the likelihood of detecting significant differences over time, and study the activity history of the projects. MethodWe performed a longitudinal study with 72 snapshots of quality projects from Github, covering the period between July 1st 2017 and June 1st 2023. We recorded monthly values of seven repository metrics (contributors, commits, closed pull-requests, merged pull-requests, closed issues, number of stars and forks), encompassing data from a total of 1991 repositories. ResultsWe observed significant changes in all the metrics evaluated, with most cases showing negligible to small effect sizes. Notably, merged pull-requests registered medium effect sizes. The evolution was not equal in all the metrics, however, after five years it was unlikely that a sample of projects remained representative for any of the analyzed metrics, showing probabilities below 25%. ConclusionAlthough the temporal validity of a sample depends on the specific data being studied, employing datasets created several years ago does not appear to be a sound strategy if the aim is to produce results that can be extrapolated to the current state of the population.
Read full abstract