This study compares the performance of two prominent AI language models, ERNIE Bot 4.0 Turbo and ChatGPT 4o, in evaluating first-year undergraduate persuasive essays within the social sciences domain. Drawing from the Louvain Corpus of Native English Essays, a comprehensive collection of academic writings by British and American university students, this study aims to examine the models’ capabilities in assessing the grammatical correctness, vocabulary usage, coherence, content depth, and writing style of the essays. This study adopts a structured evaluation framework based on IELTS writing criteria to assess the models’ performance. A 40 persuasive essays from the Louvain Corpus were evaluated by both AI models and compared with human raters’ evaluations to ensure validity. The findings reveal distinct differences in the assessment styles of the two models. ChatGPT 4o exhibits a more critical approach, pinpointing areas for improvement, such as lack of argument development, coherence issues, and grammatical errors. Conversely, ERNIE Bot 4.0 Turbo offers a more balanced assessment, acknowledging essays’ strengths and suggesting improvement areas. Notably, ERNIE Bot’s evaluation highlights potential biases in AI-based assessment systems, particularly in its unequal emphasis on viewpoints. This comparative examination offers valuable perspectives on the advantages and constraints of AI models in assessing scholarly compositions, underscoring the significance of amalgamating varied AI functionalities to establish more all-inclusive and efficient feedback systems for learners. By understanding these differences, researchers and educators can better utilize AI-assisted essay evaluation systems to enhance student learning experiences.
Read full abstract