Efficient Hybrid Semantic Text Similarity using Wordnet and a Corpus

Issa Atoum,Ahmed Otoom

doi:10.14569/ijacsa.2016.070917

Abstract

Text similarity plays an important role in natural language processing tasks such as answering questions and summarizing text. At present, state-of-the-art text similarity algorithms rely on inefficient word pairings and/or knowledge derived from large corpora such as Wikipedia. This article evaluates previous word similarity measures on benchmark datasets and then uses a hybrid word similarity in a novel text similarity measure (TSM). The proposed TSM is based on information content and WordNet semantic relations. TSM includes exact word match, the length of both sentences in a pair, and the maximum similarity between one word and the compared text. Compared with other well-known measures, results of TSM are surpassing or comparable with the best algorithms in the literature.

Highlights

Text similarity is a field of research whereby two terms or expressions are assigned a score based on the likeness of their meaning
We showed that the semantic similarity measures [66]
We found that the major performance of TSM was because of the proposed text similarity measure and the borrowed word similarity measure

Summary

Introduction

Text similarity is a field of research whereby two terms or expressions are assigned a score based on the likeness of their meaning. There are three predominant approaches to compute text similarity They can be categorized as corpus-based/ distributional semantic models (DSMs), knowledge-based models, and hybrid methods. DSMs are based on the assumption that the meaning of a word can be inferred from its usage (i.e. its distribution in text). It is based on the following hypothesis: linguistic items with similar distributions have similar meanings [10]. The vector-based representation is most often built from large text collections [5] In this category, the latent Dirichlet allocation (LDA) assumes that each document is based on a mixture of topics, whereas a topic probabilistically generates various words [6], [11]–[13]. The latent semantic analysis (LSA) is based on that the words that share similar meaning tend to occur in similar texts [6], [9], [14], [15]

Methods

Results

Conclusion