Abstract

Determining the extent to which two text snippets are semantically equivalent is a well-researched topic in the areas of natural language processing, information retrieval and text summarization. The sentence-to-sentence similarity scoring is extensively used in both generic and query-based summarization of documents as a significance or a similarity indicator. Nevertheless, most of these applications utilize the concept of semantic similarity measure only as a tool, without paying importance to the inherent properties of such tools that ultimately restrict the scope and technical soundness of the underlined applications. This paper aims to contribute to fill in this gap. It investigates three popular WordNet hierarchical semantic similarity measures, namely path-length, Wu and Palmer and Leacock and Chodorow, from both algebraical and intuitive properties, highlighting their inherent limitations and theoretical constraints. We have especially examined properties related to range and scope of the semantic similarity score, incremental monotonicity evolution, monotonicity with respect to hyponymy/hypernymy relationship as well as a set of interactive properties. Extension from word semantic similarity to sentence similarity has also been investigated using a pairwise canonical extension. Properties of the underlined sentence-to-sentence similarity are examined and scrutinized. Next, to overcome inherent limitations of WordNet semantic similarity in terms of accounting for various Part-of-Speech word categories, a WordNet “All word-To-Noun conversion” that makes use of Categorial Variation Database (CatVar) is put forward and evaluated using a publicly available dataset with a comparison with some state-of-the-art methods. The finding demonstrates the feasibility of the proposal and opens up new opportunities in information retrieval and natural language processing tasks.

Highlights

  • Measures of semantic similarities have been primarily developed for quantifying the extent of resemblance between two words or two concepts using pre-existing resources that encode word-to-word or concept-to-concept relationships as in WordNet lexical database [1, 2].Accurate comparison between text snippets for the similarity determination is a fundamental prerequisite in the areas of natural language processing, information retrieval, text summarization, document clustering, question answering, automatic essay scoring and others [3, 4]

  • – WN corresponds to the case where the sentence similarity is calculated using the canonical extension (26) with Wu and Palmer word-to-word semantic similarity measure

  • – WNwC corresponds to the case where the semantic similarity is calculated using the canonical extension (26) with Wu and Palmer word-to-word semantic similarity measure after performing the “all-to-noun” conversation

Read more

Summary

Introduction

Accurate comparison between text snippets for the similarity determination is a fundamental prerequisite in the areas of natural language processing, information retrieval, text summarization, document clustering, question answering, automatic essay scoring and others [3, 4]. The quantification of the similarity between candidate sentences can allow us to promote a good summary coverage and prevent redundancy in automatic text summarization [5]. In this respect, the similarity values of sentence pairs are sometimes used as part of the statistical features of the text summarization system. Query-based summarization crucially relies on similarity scores for summary extraction [7]. Among other text similarity applications, one shall mention machine translation [13], text classification [3, 4, 14], database where similarity is used for schema matching [15], and bioinformatics [16]

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.