Accuracy and reliability of large language models in assessing learning outcomes achievement across cognitive domains.

Swapna Haresh Teckwani,Amanda Huee-Ping Wong,Nathasha Vihangi Luke,Ivan Cherh Chiet Low

doi:10.1152/advan.00137.2024

Abstract

The advent of artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT and Gemini, has significantly impacted the educational landscape, offering unique opportunities for learning and assessment. In the realm of written assessment grading, traditionally viewed as a laborious and subjective process, this study sought to evaluate the accuracy and reliability of these LLMs in evaluating the achievement of learning outcomes across different cognitive domains in a scientific inquiry course on sports physiology. Human graders and three LLMs, GPT-3.5, GPT-4o, and Gemini, were tasked with scoring submitted student assignments according to a set of rubrics aligned with various cognitive domains, namely "Understand," "Analyze," and "Evaluate" from the revised Bloom's taxonomy and "Scientific Inquiry Competency." Our findings revealed that while LLMs demonstrated some level of competency, they do not yet meet the assessment standards of human graders. Specifically, interrater reliability (percentage agreement and correlation analysis) between human graders was superior as compared to between two grading rounds for each LLM, respectively. Furthermore, concordance and correlation between human and LLM graders were mostly moderate to poor in terms of overall scores and across the pre-specified cognitive domains. The results suggest a future where AI could complement human expertise in educational assessment but underscore the importance of adaptive learning by educators and continuous improvement in current AI technologies to fully realize this potential.NEW & NOTEWORTHY The advent of large language models (LLMs) such as ChatGPT and Gemini has offered new learning and assessment opportunities to integrate artificial intelligence (AI) with education. This study evaluated the accuracy of LLMs in assessing an assignment from a course on sports physiology. Concordance and correlation between human graders and LLMs were mostly moderate to poor. The findings suggest AI's potential to complement human expertise in educational assessment alongside the need for adaptive learning by educators.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Accuracy and reliability of large language models in assessing learning outcomes achievement across cognitive domains.

Abstract

Talk to us

Similar Papers

More From: Advances in physiology education

Lead the way for us

Similar Papers

How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
Galit Shmueli ... Bianca Maria Colosimo
INFORMS Journal on Data Science | VOL. 2
Galit Shmueli, et. al.Galit Shmueli ... Bianca Maria Colosimo
01 Apr 2023
INFORMS Journal on Data Science | VOL. 2

Response to M. Trengove & coll regarding "Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine".
Stefan Harrer
eBioMedicine | VOL. 93
Stefan HarrerStefan Harrer
01 Jul 2023
eBioMedicine | VOL. 93

ChatGPT Isn't Magic
Tama Leaver ... Suzanne Srdarov
M/C Journal | VOL. 26
Tama Leaver, et. al.Tama Leaver ... Suzanne Srdarov
02 Oct 2023
M/C Journal | VOL. 26

Getting AI Right: Introductory Notes on AI & Society
James Manyika
Daedalus | VOL. 151
James ManyikaJames Manyika
01 May 2022
Daedalus | VOL. 151

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Accuracy and reliability of large language models in assessing learning outcomes achievement across cognitive domains.

Abstract

Talk to us

Similar Papers

More From: Advances in physiology education