Abstract

Previous work has shown that automated essay scoring systems, in particular machine learning-based systems, are not capable of assessing the quality of essays, but are relying on essay length, a factor irrelevant to writing proficiency. In this work, we first show that state-of-the-art systems, recent neural essay scoring systems, might be also influenced by the correlation between essay length and scores in a standard dataset. In our evaluation, a very simple neural model shows the state-of-the-art performance on the standard dataset. To consider essay content without taking essay length into account, we introduce a simple neural model assessing the similarity of content between an input essay and essays assigned different scores. This neural model achieves performance comparable to the state of the art on a standard dataset as well as on a second dataset. Our findings suggest that neural essay scoring systems should consider the characteristics of datasets to focus on text quality.

Highlights

  • Introduction of English as a ForeignLanguage dataset (TOEFL, Blanchard et al (2013)), which has a lower corre-Automated essay scoring (AES) is the task of as- lation between essay length and scores.signing a score for a given essay, aiming to replicate Second, we demonstrate that considering essay human scoring results

  • AES systems are not capable of assessing the qual- We demonstrate that this neural model achieves ity of essays (Winerip, 2005; Ben-Simon and Ben- performance comparable to the state of the art on nett, 2007; Wolfe et al, 2016), but work both datasets

  • Automated Student Assessment Prize (ASAP), we view this as evidence that the performance of previous neural models might be influenced by the correlation of essay length and scores in the target dataset

Read more

Summary

Introduction

Introduction of English as a ForeignLanguage dataset (TOEFL, Blanchard et al (2013)), which has a lower corre-Automated essay scoring (AES) is the task of as- lation between essay length and scores.signing a score for a given essay, aiming to replicate Second, we demonstrate that considering essay human scoring results. Essay length and scores in the standard dataset leads to top performance. Recent neural essay scoring systems, which do not employ a feature capturing essay length explicitly, achieve state-of-the-art performance.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call