Abstract

The measurement of the relatedness of word semantics based on complementary Wikipedia and WordNet-based methods takes two forms, combined and integrative, which are aimed at increasing the semantic space between related words. However, each form has its own set of issues regarding its components and the strategy that is used to combine and integrate corpus-based and knowledge-based methods. In the integrative strategy, a large corpus, such as Wikipedia, is used to extract a set of related words for a particular concept as a basis for searching the WordNet space. The drawback to this strategy is in its use of a fixed scaling parameter, which only fits an implemented dataset that is near to a human score. Other corpus-based methods use a cut-off threshold that is determined experimentally to reduce the semantic space and to increase the search for a more accurate semantic space. Such methods merely take into account the frequency of bigrams, while ignoring the frequency of individual terms. Knowledge-based methods using a gloss overlap have a similar limitation to the corpus-based methods, where they lead to the loss of many valuable relatedness features that determine a more accurate measurement. Thus, in this paper, a new Information Content Glossary Relatedness (ICGR) approach was proposed in two steps, namely, an Extended-PMI based on a cut-off density threshold was proposed to extract a Robust Relatedness Vector set (RVS) of a large Wikipedia dataset. Then, a Semantic Structural Information (SSI) method was presented to use the RVS as a fulcrum to define the most relatedness gloss in the WordNet of each gloss and to select the top 5 glosses related to each RVS. The results showed that the proposed approach outperformed the state-of-the-art set, where the Extended-PMI achieved a Spearman’s correlation of 0.89 to the human score and the ICGR approach achieved a Spearman’s correlation of 0.8 to the human score.

Highlights

  • The similarity of words expressed in a natural language is a challenge in NLP in several domains

  • The whole Information Content Glossary Relatedness (ICGR) approach, which includes the Extended-PMI, was used with the Semantic Structural Information (SSI) method to define the most relatedness gloss of each vector in the SVRs, where the semantic space of each word in the short text was expanded with words from the Most Related Gloss (MRG)

  • A dataset with 178.6 MB of articles was selected from the Wikipedia dumps in November 20, 2017, where the bigram generated a high pair word space, which was used by the Extended-PMI as a search space to construct the highest Semantic Relatedness Vector (SRV) set.For example, a high bigram space will be ideal for a high SVR, which includes low density data

Read more

Summary

Introduction

The similarity of words expressed in a natural language is a challenge in NLP in several domains It is often decompressed into a comparison of the semantic relations between concepts, depending on knowledgebased corpora such as Wikipedia and WordNet. Knowledge-based methods are bound by the content of the terminological resource, while the context might contain additional content not covered by them (Jimeno-Yepes and Aronson, 2012). Knowledge-based methods are bound by the content of the terminological resource, while the context might contain additional content not covered by them (Jimeno-Yepes and Aronson, 2012) Such methods derive word-to-word semantic relatedness by searching the taxonomy of four features, namely, path and depthbased, content-based, gloss-based and combined approaches (Aouicha et al, 2016), where similarities can be detected through several semantic features such as those thatare measured by detecting elements of homonyms, synonyms, meronyms, holonyms, hypernyms, hyponyms and antonyms(Alzahrani, 2016; De Luca and Nürnberger, 2006).

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.