Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

Xiaoyi Tang,Hongwei Chen,Daoyu Lin,Kexin Li

doi:10.1016/j.heliyon.2024.e34262

Abstract

Recent advancements in natural language processing, computational linguistics, and Artificial Intelligence (AI) have propelled the use of Large Language Models (LLMs) in Automated Essay Scoring (AES), offering efficient and unbiased writing assessment. This study assesses the reliability of LLMs in AES tasks, focusing on scoring consistency and alignment with human raters. We explore the impact of prompt engineering, temperature settings, and multi-level rating dimensions on the scoring performance of LLMs. Results indicate that prompt engineering significantly affects the reliability of LLMs, with GPT-4 showing marked improvement over GPT-3.5 and Claude 2, achieving 112% and 114% increase in scoring accuracy under the criteria and sample-referenced justification prompt. Temperature settings also influence the output consistency of LLMs, with lower temperatures producing scores more in line with human evaluations, which is essential for maintaining fairness in large-scale assessment. Regarding multi-dimensional writing assessment, results indicate that GPT-4 performs well in dimensions regarding Ideas (QWK=0.551) and Organization (QWK=0.584) under well-crafted prompt engineering. These findings pave the way for a comprehensive exploration of LLMs' broader educational implications, offering insights into their capability to refine and potentially transform writing instruction, assessment, and the delivery of diagnostic and personalized feedback in the AI-powered educational age. While this study attached importance to the reliability and alignment of LLM-powered multi-dimensional AES, future research should broaden its scope to encompass diverse writing genres and a more extensive sample from varied backgrounds.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

Abstract

Talk to us

Similar Papers

More From: Heliyon

Lead the way for us

Journal: Heliyon	Publication Date: Jul 1, 2024
License type: cc-by-nc-nd

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... Bianca Maria Colosimo
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... Bianca Maria Colosimo
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Advancing radiology practice and research: harnessing the potential of large language models amidst imperfections.
Eyal Klang ... Eyal Zimlichman
BJR open | VOL. 6
Eyal Klang, et. al.Eyal Klang ... Eyal Zimlichman
12 Dec 2023
BJR open | VOL. 6

Getting AI Right: Introductory Notes on AI & Society
James Manyika
Daedalus | VOL. 151
James ManyikaJames Manyika
01 May 2022
Daedalus | VOL. 151

Applying large language models for automated essay scoring for non-native Japanese
Wenchao Li ... Haitao Liu
Humanities and Social Sciences Communications | VOL. 11
Wenchao Li, et. al.Wenchao Li ... Haitao Liu
03 Jun 2024
Humanities and Social Sciences Communications | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

Abstract

Talk to us

Similar Papers

More From: Heliyon