Abstract

From their earliest origins, automated essay scoring systems strived to emulate human essay scores and viewed them as their ultimate validity criterion. Consequently, the importance (or weight) and even identity of computed essay features in the composite machine score were determined by statistical techniques that sought to optimally predict human scores from essay features. However, it is evident that machine evaluation of essays is fundamentally different from human evaluation and therefore is not likely to measure the same set of writing skills. As a consequence, feature weights of human-prediction machine scores (reflecting their importance in the composite scores) are bound to reflect statistical artifacts. This article suggests alternative feature weighting schemes based on the premise of maximizing reliability and internal consistency of the composite score. The article shows, in the context of a large-scale writing assessment, that these alternative weighting schemes are significantly different from human-prediction weights and give rise to comparable or even superior reliability and validity coefficients.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.