Assessing students' answers is a labor-intensive task that could significantly burden educators. Technological advances have led to the development of automatic scoring systems. However, handwriting recognition errors significantly impacts the deployment of automatic scoring in the context of paper-pencil assessment, which is still the most widely used format of academic exams. Large language models (LLMs) have proven to be effective in scoring tasks and are promising for automatic short answer grading. Nevertheless, LLMs' capability to grade the short answers with handwriting recognition errors is underexplored. The current study addressed this issue in the context of paper-pencil Chinese tests in elementary schools. The LLM used was ERNIE 4.0 because of its outstanding capability in Chinese language comprehension. We compared the grading accuracy of the model on the raw data extracted from handwritten answers by optical character recognition with that on the preprocessed data where the recognition errors were corrected. We found a substantial accuracy difference between raw data and preprocessed data, indicating that LLMs have not yet achieved the precision for grading short answers with handwriting recognition errors. Nevertheless, LLMs showed an interesting characteristic during grading: awarding points to incorrect answers as an acknowledgment of student's effort.
Read full abstract