Measuring Semantic Similarity between Words Using Web Documents

Sheetal A,Sushma S

doi:10.14569/ijacsa.2010.010414

Abstract

Semantic similarity measures play an important role in the extraction of semantic relations. Semantic similarity measures are widely used in Natural Language Processing (NLP) and Information Retrieval (IR). The work proposed here uses web-based metrics to compute the semantic similarity between words or terms and also compares with the state-of-the-art. For a computer to decide the semantic similarity, it should understand the semantics of the words. Computer being a syntactic machine, it can not understand the semantics. So always an attempt is made to represent the semantics as syntax. There are various methods proposed to find the semantic similarity between words. Some of these methods have used the precompiled databases like WordNet, and Brown Corpus. Some are based on Web Search Engine. The approach presented here is altogether different from these methods. It makes use of snippets returned by the Wikipedia or any encyclopedia such as Britannica Encyclopedia. The snippets are preprocessed for stop word removal and stemming. For suffix removal an algorithm by M. F. Porter is referred. Luhn’s Idea is used for extraction of significant words from the preprocessed snippets. Similarity measures proposed here are based on the five different association measures in Information retrieval, namely simple matching, Dice, Jaccard, Overlap, Cosine coefficient. Performance of these methods is evaluated using Miller and Charle’s benchmark dataset. It gives higher correlation value of 0.80 than some of the existing methods.

Highlights

Semantic similarity is a central concept that finds great importance in various fields such as artificial intelligence, natural language processing, cognitive science and psychology
Word semantic similarity approaches or metrics can be categorized as: (i) Pre-compiled database based metrics, i.e., metrics consulting only human-built knowledge resources, such as ontologies, (ii) Co-occurrence based metrics using WWW, i.e., metrics that assume that the semantic similarity between words or terms can be expressed by an association ratio which is a function of their co-occurrence (iii) Context based metrics using WWW, i.e., metrics that are fully text-based and understand and utilize the context or proximity of words or terms to compute semantic similarity
This paper presents five different semantic similarity methods

Summary

INTRODUCTION

Semantic similarity is a central concept that finds great importance in various fields such as artificial intelligence, natural language processing, cognitive science and psychology. A computer being a syntactic machine, semantics associated with the words or terms is to be represented as syntax For this various approaches are proposed till now. Several Precompiled database based methods have been proposed in the literature that use, e.g., WordNet, for semantic similarity computation. Danushka Bollegala [6] has proposed similarity measures using page count returned by the search engine for the given word pair. Page-count-based metrics use association ratios between words that are computed using their co-occurrence frequency in documents. The methods proposed here understand the semantics associated with the word by making use of snippets returned by the Wikipedia or Britannica Encyclopedia for the given word pair. In the proposed methods syntactic representation of the semantics associated word is achieved by following theses three steps. Similarity between words is decided using this set of keywords

Snippet Extraction

Snippet Preprocessing

Similarity Measures

Method

CONCLUSION