Abstract

ChatGPT, released in 2022, has garnered attention due to its adaptability through prompt engineering, enabling users to guide its responses. It is important to note that the extent to which users can modify ChatGPT remains limited, as its core embeddings stay unaltered through prompt engineering. Thus, this study aims to evaluate the effectiveness of ChatGPT and its fine-tuned model in essay evaluation compared to human raters. A total of 904 essays from the YELC 2011, on the subject of physical punishment, were selected for this study. Among these, 723 essays were used for fine-tuning ChatGPT, and the remaining 181 were reserved for testing the language model. Additionally, an extra set of 200 essays on different topics, such as driving and medical issues, was included to evaluate the language model’s performance across various themes. Inter-rater reliability indices, including measures like correlation, agreement, Cohen’s kappa, and Krippendorff’s alpha, along with many-facet Rasch measurement analysis, collectively indicated that the current version of ChatGPT (gpt-3.5-turbo-0613) is not yet poised to fully supplant human raters in essay scoring. Nevertheless, through the fine-tuning process, the model demonstrated a significant level of agreement with human raters and exhibited a marked degree of consistency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call