Comparing test quality measures for assessing student-written tests

Stephen H Edwards,Zalia Shams

doi:10.1145/2591062.2591164

Abstract

Many educators now include software testing activities in programming assignments, so there is a growing demand for appropriate methods of assessing the quality of student-written software tests. While tests can be hand-graded, some educators also use objective performance metrics to assess software tests. The most common measures used at present are code coverage measures—tracking how much of the student’s code (in terms of statements, branches, or some combination) is exercised by the corresponding software tests. Code coverage has limitations, however, and sometimes it overestimates the true quality of the tests. Some researchers have suggested that mutation analysis may provide a better indication of test quality, while some educators have experimented with simply running every student’s test suite against every other student’s program—an “all-pairs” strategy that gives a bit more insight into the quality of the tests. However, it is still unknown which one of these measures is more accurate, in terms of most closely predicting the true bug revealing capability of a given test suite. This paper directly compares all three methods of measuring test quality in terms of how well they predict the observed bug revealing capabilities of student-written tests when run against a naturally occurring collection of student-produced defects. Experimental results show that all-pairs testing—running each student’s tests against every other student’s solution—is the most effective predictor of the underlying bug revealing capability of a test suite. Further, no strong correlation was found between bug revealing capability and either code coverage or mutation analysis scores.

Full Text