Assessing effectiveness of test suites: what do we know and what should we do?

Peng Zhang,Yuming Zhou,Xutong Liu,Yanhui Li,Lin Chen,Ziyuan Wang,Yibiao Yang,Xiao Yu,Chang-Ai Sun,Yang Wang,Zeyu Lu

doi:10.1145/3635713

Abstract

Background. Software testing is a critical activity for ensuring the quality and reliability of software systems. To evaluate the effectiveness of different test suites, researchers have developed a variety of metrics. Problem. However, comparing these metrics is challenging due to the lack of a standardized evaluation framework including comprehensive factors. As a result, researchers often focus on single factors (e.g., size), which finally leads to different or even contradictory conclusions. After comparing dozens of pieces of work in detail, we have found two main problems most troubling to our community: (1) researchers tend to oversimplify the description of the ground truth they use, and (2) data involving real defects is not suitable for analysis using traditional statistical indicators. Objective. We aim at scrutinizing the whole process of comparing test suites for our community. Method. To hit this aim, we propose a framework ASSENT (ev A luating te S t S uite E ffective N ess me T rics) to guide the follow-up research for evaluating a test suite effectiveness metric. ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. Its functioning is as follows: first, users clarify the ground truth for determining the real order in effectiveness among test suites. Second, users generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, users use the metric to derive the order in effectiveness for the same test suites. Finally, users calculate the agreement indicator between the two orders derived by two metrics. Result. With ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score and code coverage metrics. Our results show that, based on the real faults, mutation score and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, test effectiveness will be overestimated by more than 20% in values. Conclusion. We recommend that the standardized evaluation framework ASSENT should be used for evaluating and comparing test effectiveness metrics in the future work.

Full Text