[This paper is part of the Focused Collection in Artificial Intelligence Tools in Physics Teaching and Physics Education Research.] Using a high-stakes thermodynamics exam as the sample (252 students, four multipart problems), we investigate the viability of four workflows for AI-assisted grading of handwritten student solutions. We find that the greatest challenge lies in converting handwritten answers into a machine-readable format. The granularity of grading criteria also influences grading performance: employing a fine-grained rubric for entire problems often leads to errors and grading failures, as the model appears to be unable to keep track of scores for more than a handful of rubric items, while grading problems in parts is more reliable but tends to miss nuances. We also found that grading hand-drawn graphics, such as process diagrams, is less reliable than mathematical derivations due to the difficulty in differentiating essential details from extraneous information. Although the system is precise in identifying exams that meet passing criteria, exams with failing grades still require human grading. We conclude with recommendations to overcome some of the encountered challenges. Published by the American Physical Society 2024
Read full abstract