Abstract

As an essential component of many Natural Language Processing applications, semantic similarity measure has been studied for decades. Recent research results indicate that the Subject-Action-Object (SAO) structure in sentences is more desirable for describing the technological information, and SAO-based similarity measure outperforms classical text-based ones. The typical approach in the literature to finding the similarity between two SAO structures relies on a term matching technique, which produces the similarity score by the Sorensen-Dice index, i.e., the proportion of the total number of matching terms. However, in this paper, we observe that the entities in the SAO structures usually have a small number of terms, which makes the currently acknowledged methods have a high recurrence rate and poor accuracy. To settle this issue, we extend the Sorensen-Dice index, and present a new unified framework for the SAO similarity measure that can give a higher discrimination. The effectiveness of our measure is evaluated on the basis of patent data sets in the Nano-Fertilizer field. The results show that our measure can significantly improve the accuracy than the currently acknowledged ones. The proposed measure has an excellent flexibility and robustness, and can be easily used for patent similarity measure. In addition, the extended Sorensen-Dice index is of independent interest, and has potential applications for other similarity measures.

Highlights

  • Semantic similarity analysis is an indispensable module for applications in natural language processing (NLP) and related areas [1], such as text mining [2], information retrieval [3], machine learning [4], [5], and patent analysis [6], [7]

  • Based on the extended Sørensen-Dice index, we presented a unified framework for the SAO similarity measure in a modular way, which can give a higher discrimination

  • The results show that our extended Sørensen-Dice index can dramatically reduce the recurrence rate, and our proposed SAO similarity measure can significantly improve the accuracy and F-measure compared with the acknowledged one

Read more

Summary

INTRODUCTION

Semantic similarity analysis is an indispensable module for applications in natural language processing (NLP) and related areas [1], such as text mining [2], information retrieval [3], machine learning [4], [5], and patent analysis [6], [7]. X. Li et al.: Generic SAO Similarity Measure via Extended Sørensen-Dice Index are proposed [24]–[26], and widely used in patent analysis and technological evolution analysis [27]. As shown by Wang et al [7], the performance of this commonly used method is far from desirable due to the relatively high recurrence rate and poor discrimination Such a situation is caused by the fact that the number of terms (words or phrases) in the SAO structures is small. We remark that in the context the sets will be instantiated by entities in SAO structure, and the elements will be instantiated by terms or words

ACKNOWLEDGED SØRENSEN-DICE INDEX FOR SAO STRUCTURES
OUR EXTENDED SØRENSEN-DICE INDEX FOR SAO STRUCTURES
A UNIFIED FRAMEWORK FOR SAO SIMILARITY MEASURE
TERM-VS-TERM SIMILARITY
ENTITY-VS-ENTITY SIMILARITY
EVALUATION AND RESULTS
THE SEMANTIC SIMILARITY MEASURES BETWEEN THE SAO STRUCTURES
CONCLUSION AND DISCUSSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call