An automated system with a versatile test oracle for assessing student programs

Chung M Tang,Yuen T Yu,Chung K Poon

doi:10.1002/cae.22577

Abstract

AbstractAutomated program assessment systems have been widely adopted in many universities. Many of these systems judge the correctness of student programs by comparing their actual outputs with predefined expected outputs for selected test inputs. A common weakness of such systems is that student programs would be marked as incorrect as long as their outputs deviate from the predefined ones, even if the deviations are only minor, insignificant, and considered acceptable by a human assessor that the programs have satisfied the specifications. This critical weakness caused undue frustration to students and undesirable pedagogical consequences that undermine these systems’ benefits. To address this issue, we developed an improved mechanism for program output comparison to serve as a versatile test oracle that brings the results of automated assessment much closer to those of human assessors. We evaluated the new mechanism in real programming classes using an existing automated program assessment system. We found that the new mechanism achieved zero false‐positive error (did not wrongly accept any incorrect output) and very low (0%–0.02%) false‐negative error (that wrongly rejected correct outputs), with very high accuracy (99.8%–100%) in correctly recognizing outputs deemed acceptable by instructors. This represents a major improvement over an existing assessment mechanism, which had 56.4%–64.1% false‐negative error with an accuracy of 25.4%–40.9%. Moreover, about 67%–96% of students achieved their best results in their first attempt, which could be encouraging to them and reduce their frustration. Furthermore, students generally welcomed the new assessment mechanism and agreed it was beneficial to their learning.

Full Text