A Survey of Text Similarity Approaches

Wael H.Gomaa,Aly A Fahmy

doi:10.5120/11638-7118

Abstract

ABSTRACT Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Survey of Text Similarity Approaches

Abstract

Talk to us

Similar Papers

More From: International Journal of Computer Applications

Lead the way for us

Journal: International Journal of Computer Applications	Publication Date: Apr 18, 2013
Citations: 488

Similar Papers

A Review on Text Similarity Technique used in IR and its Application
Nitesh Pradhan ... Rajesh Wadhvani
International Journal of Computer Applications | VOL. 120
Nitesh Pradhan, et. al.Nitesh Pradhan ... Rajesh Wadhvani
18 Jun 2015
International Journal of Computer Applications | VOL. 120

Semantic Similarity Measures for Malay Sentences
Shahrul Azman Noah ... Nazlia Omar
-
Shahrul Azman Noah, et. al.Shahrul Azman Noah ... Nazlia Omar
10 Dec 2007
10 Dec 2007

An Automated System for Measuring Similarity between Software Requirements
Fatma A Mihany ... Ehab Ezzat
-
Fatma A Mihany, et. al.Fatma A Mihany ... Ehab Ezzat
28 May 2016
28 May 2016

Word Embedding based Textual Semantic Similarity Measure in Bengali
Md Asif Iqbal ... Iqbal H Sarker
Procedia Computer Science | VOL. 193
Md Asif Iqbal, et. al.Md Asif Iqbal ... Iqbal H Sarker
01 Jan 2020
Procedia Computer Science | VOL. 193

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Survey of Text Similarity Approaches

Abstract

Talk to us

Similar Papers

More From: International Journal of Computer Applications