Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments

Fatih Yavuz,Gamze Yavaş Çelik,Özgür Çelik

doi:10.1111/bjet.13494

Fatih Yavuz, Gamze Yavaş Çelik + Show 1 more

Open Access

https://doi.org/10.1111/bjet.13494

Copy DOI

Journal: British Journal of Educational Technology	Publication Date: Jun 4, 2024
Citations: 1	License type: CC BY-NC-ND 4.0

Abstract

AbstractThis study investigates the validity and reliability of generative large language models (LLMs), specifically ChatGPT and Google's Bard, in grading student essays in higher education based on an analytical grading rubric. A total of 15 experienced English as a foreign language (EFL) instructors and two LLMs were asked to evaluate three student essays of varying quality. The grading scale comprised five domains: grammar, content, organization, style & expression and mechanics. The results revealed that fine‐tuned ChatGPT model demonstrated a very high level of reliability with an intraclass correlation (ICC) score of 0.972, Default ChatGPT model exhibited an ICC score of 0.947 and Bard showed a substantial level of reliability with an ICC score of 0.919. Additionally, a significant overlap was observed in certain domains when comparing the grades assigned by LLMs and human raters. In conclusion, the findings suggest that while LLMs demonstrated a notable consistency and potential for grading competency, further fine‐tuning and adjustment are needed for a more nuanced understanding of non‐objective essay criteria. The study not only offers insights into the potential use of LLMs in grading student essays but also highlights the need for continued development and research. Practitioner notesWhat is already known about this topic Large language models (LLMs), such as OpenAI's ChatGPT and Google's Bard, are known for their ability to generate text that mimics human‐like conversation and writing. LLMs can perform various tasks, including essay grading. Intraclass correlation (ICC) is a statistical measure used to assess the reliability of ratings given by different raters (in this case, EFL instructors and LLMs). What this paper adds The study makes a unique contribution by directly comparing the grading performance of expert EFL instructors with two LLMs—ChatGPT and Bard—using an analytical grading scale. It provides robust empirical evidence showing high reliability of LLMs in grading essays, supported by high ICC scores. It specifically highlights that the overall efficacy of LLMs extends to certain domains of essay grading. Implications for practice and/or policyThe findings open up potential new avenues for utilizing LLMs in academic settings, particularly for grading student essays, thereby possibly alleviating workload of educators.The paper's insistence on the need for further fine‐tuning of LLMs underlines the continual interplay between technological advancement and its practical applications.The results lay down a footprint for future research in advancing the use of AI in essay grading.

Full Text