Assessing second-language academic writing: AI vs. Human raters

Vasfiye Gecki̇n,Çağatay Çinar,Ebru Kiziltaş

doi:10.31681/jetol.1336599

Abstract

The quality of writing in a second language (L2) is one of the indicators of the level of proficiency for many college students to be eligible for departmental studies. Although certain software programs, such as Intelligent Essay Assessor or IntelliMetric, have been introduced to evaluate second-language writing quality, an overall assessment of writing proficiency is still largely achieved through trained human raters. The question that needs to be addressed today is whether generative artificial intelligence (AI) algorithms of large language models (LLMs) could facilitate and possibly replace human raters when it comes to the burdensome task of assessing student-written academic work. For this purpose, first-year college students (n=43) were given a paragraph writing task which was evaluated through the same writing criteria introduced to the generative pre-trained transformer, ChatGPT-3.5, and five human raters. The scores assigned by the five human raters revealed a statistically significant low to high positive correlation. A slight to fair but significant level of agreement was observed in the scores assigned by ChatGPT-3.5 and two of the human raters. The findings suggest that reliable results could be obtained when the scores of an application and multiple human raters are considered and that ChatGPT may potentially assist human raters in assessing L2 college writing.

Full Text