Applying large language models and chain-of-thought for automatic scoring

Gyeong-Geon Lee,Ehsan Latif,Xuansheng Wu,Ninghao Liu,Xiaoming Zhai

doi:10.1016/j.caeai.2024.100213

Abstract

This study investigates the application of large language models (LLMs), specifically GPT-3.5 and GPT-4, with Chain-of-Though (CoT) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools among researchers and educators. With a testing dataset comprising six assessment tasks (three binomial and three trinomial) with 1,650 student responses, we employed six prompt engineering strategies to automatically score student responses. The six strategies combined zero-shot or few-shot learning with CoT, either alone or alongside item stem and scoring rubrics, developed based on a novel approach, WRVRT (prompt writing, reviewing, validating, revising, and testing). Results indicated that few-shot (acc = 0.67) outperformed zero-shot learning (acc = 0.60), with 12.6% increase. CoT, when used without item stem and scoring rubrics, did not significantly affect scoring accuracy (acc = 0.60). However, CoT prompting paired with contextual item stems and rubrics proved to be a significant contributor to scoring accuracy (13.44% increase for zero-shot; 3.7% increase for few-shot). We found a more balanced accuracy across different proficiency categories when CoT was used with a scoring rubric, highlighting the importance of domain-specific reasoning in enhancing the effectiveness of LLMs in scoring tasks. We also found that GPT-4 demonstrated superior performance over GPT-3.5 in various scoring tasks when combined with the single-call greedy sampling or ensemble voting nucleus sampling strategy, showing 8.64% difference. Particularly, the single-call greedy sampling strategy with GPT-4 outperformed other approaches. This study also demonstrates the potential of LLMs in facilitating explainable and interpretable automatic scoring, emphasizing that CoT enhances accuracy and transparency, particularly when used with item stem and scoring rubrics.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computers and Education: Artificial Intelligence	Publication Date: Feb 27, 2024
Citations: 12	License type: cc-by-nc-nd

R Discovery Prime

R Discovery Prime

Applying large language models and chain-of-thought for automatic scoring

Abstract

Talk to us

Similar Papers

More From: Computers and Education: Artificial Intelligence

Lead the way for us

Similar Papers

Improving the use of LLMs in radiology through prompt engineering: from precision prompts to zero-shot learning.
Fabian Bamberg ... Alexander Rau
RoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin | VOL. 196
Fabian Bamberg, et. al.Fabian Bamberg ... Alexander Rau
26 Feb 2024
RoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin | VOL. 196

Enhancing Information Retrieval in the Drilling Domain: Zero-Shot Learning with Large Language Models for Question-Answering
F J Pacis ... T Wiktorski
-
F J Pacis, et. al.F J Pacis ... T Wiktorski
27 Feb 2024
27 Feb 2024

Improving deep learning with prior knowledge and cognitive models: A survey on enhancing explainability, adversarial robustness and zero-shot learning
Fuseini Mumuni ... Alhassan Mumuni
Cognitive Systems Research | VOL. 84
Fuseini Mumuni, et. al.Fuseini Mumuni ... Alhassan Mumuni
30 Nov 2023
Cognitive Systems Research | VOL. 84

Exploring Large Language Models for Detecting Online Vaccine Reactions.
Sedigh Khademi ... Jim Buttery
Studies in health technology and informatics | VOL. 318
Sedigh Khademi, et. al.Sedigh Khademi ... Jim Buttery
24 Sep 2024
Studies in health technology and informatics | VOL. 318

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Applying large language models and chain-of-thought for automatic scoring

Abstract

Talk to us

Similar Papers

More From: Computers and Education: Artificial Intelligence