Corpus-based Similarity Research Articles

Along with the development of distance education, emerges the demand for virtual environments as the automated evaluation studies of essays that has already produced promising results. However, when dealing with short answers, replicating the decisions of a human grader is still a challenge, as the portability of essay evaluation techniques to short answers has not produced results with the same level of accuracy. In this sense, the present paper aims to foster the development of studies in the field of automated evaluation of short discursive answers. The related works presented three main approaches: text-to-text similarity, knowledge-based similarity that rely on synonym dictionary and corpus-based similarity that rely on a related corpus. The present study has employed an n-gram based similarity and a categorization process applied to three sets of answers to questions in Portuguese language: two of them (Biology and Geography) obtained from an admission process to higher education and the third (Philosophy) from a virtual learning environment. The employed method was comprised of a five-stage pipeline architecture: corpus selection, preprocessing, variable generation, classification and accuracy validation. In these three corpora, several similarity measurements and distances resulting from the unigrams/bigrams combination were explored. During the classification stage, two methods were used: multiple linear regression and K-Nearest Neighbors (KNN). At the same time some research questions were revised leading to meaningful findings. As for the system efficiency regarding the Biology corpus, the accuracy was 84.01 system vs. human compared to 93.85 human vs. human; for the Geography corpus, the accuracy was 86.29 system vs. human compared to 84.93 human vs. human; and for the Philosophy corpus, findings revealed 81.59 accuracy system vs. human. These results, when compared with those obtained from recent experiments produced by other techniques indicate advantages in terms of a simpler method added to good accuracy.

Read full abstract

ABSTRACT Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Read full abstract

Corpus-based Similarity Research Articles

Related Topics

Articles published on Corpus-based Similarity

Automated Evaluation of Short Answers Using Text Similarity for the Portuguese Language

Organizational Principles of Abstract Words in the Human Brain

Semantic Similarity Assessment Using Differential Evolution Algorithm in Continuous Vector Space

Arabic Short Answer Scoring with Effective Feedback for Students

A Survey of Text Similarity Approaches

Short Answer Grading Using String Similarity And Corpus-Based Similarity

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Corpus-based Similarity Research Articles

Related Topics

Articles published on Corpus-based Similarity

Automated Evaluation of Short Answers Using Text Similarity for the Portuguese Language

Organizational Principles of Abstract Words in the Human Brain

Semantic Similarity Assessment Using Differential Evolution Algorithm in Continuous Vector Space

Arabic Short Answer Scoring with Effective Feedback for Students

A Survey of Text Similarity Approaches

Short Answer Grading Using String Similarity And Corpus-Based Similarity