Abstract

Along with the development of distance education, emerges the demand for virtual environments as the automated evaluation studies of essays that has already produced promising results. However, when dealing with short answers, replicating the decisions of a human grader is still a challenge, as the portability of essay evaluation techniques to short answers has not produced results with the same level of accuracy. In this sense, the present paper aims to foster the development of studies in the field of automated evaluation of short discursive answers. The related works presented three main approaches: text-to-text similarity, knowledge-based similarity that rely on synonym dictionary and corpus-based similarity that rely on a related corpus. The present study has employed an n-gram based similarity and a categorization process applied to three sets of answers to questions in Portuguese language: two of them (Biology and Geography) obtained from an admission process to higher education and the third (Philosophy) from a virtual learning environment. The employed method was comprised of a five-stage pipeline architecture: corpus selection, preprocessing, variable generation, classification and accuracy validation. In these three corpora, several similarity measurements and distances resulting from the unigrams/bigrams combination were explored. During the classification stage, two methods were used: multiple linear regression and K-Nearest Neighbors (KNN). At the same time some research questions were revised leading to meaningful findings. As for the system efficiency regarding the Biology corpus, the accuracy was 84.01 system vs. human compared to 93.85 human vs. human; for the Geography corpus, the accuracy was 86.29 system vs. human compared to 84.93 human vs. human; and for the Philosophy corpus, findings revealed 81.59 accuracy system vs. human. These results, when compared with those obtained from recent experiments produced by other techniques indicate advantages in terms of a simpler method added to good accuracy.

Highlights

  • Evaluations of discursive answers are of great relevance, as they assess learning outcomes, emphasizing students’ performance in writing, including higher-order thinking skills, such as synthesis and analysis (Magnini et al, 2005; Zupanc and Bosnic, 2017; Shermis et al, 2002)

  • Researchs on written texts automated evaluation has been underway since the sixties (Page, 1966; Hearst, 2000; Noorbehbahani and Kardan, 2011), producing a variety of systems, especially for scoring essays, as we can see in the examples bellow: E-rater (Burstein et al, 1998) relies on statistical surface feature models as well as on Natural Language Processing (NLP) techniques; its adjacent agreement achieves values ranging from 0.87 to 0.94

  • In the related work on the automatic assessment of short answers, we find three main approaches: text-totext similarity; knowledge-based similarity relying on a synonym dictionary (e.g., WordNet) to expand the vocabulary; and corpus-based similarity relying on a related corpus to expand the vocabulary

Read more

Summary

Introduction

Evaluations of discursive answers are of great relevance, as they assess learning outcomes, emphasizing students’ performance in writing, including higher-order thinking skills, such as synthesis and analysis (Magnini et al, 2005; Zupanc and Bosnic, 2017; Shermis et al, 2002). Considering these aspects, automated evaluation may represent an essential tool in learning environments. Researchs on written texts automated evaluation has been underway since the sixties (Page, 1966; Hearst, 2000; Noorbehbahani and Kardan, 2011), producing a variety of systems, especially for scoring essays (long discursive answers), as we can see in the examples bellow: E-rater (Burstein et al, 1998) relies on statistical surface feature models (word frequency/sentences, grammar mistakes, readability etc.) as well as on Natural Language Processing (NLP) techniques; its adjacent agreement achieves values ranging from 0.87 to 0.94 (difference of 1 point in a six-point scale)

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call