A Graph Modeling of Semantic Similarity between Words

Marco Álvarez ,Seungjin Lim

doi:10.1109/icsc.2007.2

Abstract

The problem of measuring the semantic similarity between pairs of words has been considered a fundamental operation in data mining and information retrieval. Nevertheless, developing a computational method capable of generating satisfactory results close to what humans would perceive is still a difficult task somewhat owed to the subjective nature of similarity. In this paper, it is presented a novel algorithm for scoring the semantic similarity (SSA) between words. Given two input words w1and w2, SSA exploits their corresponding concepts, relationships, and descriptive glosses available in WordNet in order to build a rooted weighted graph Gsim. The output score is calculated by exploring the concepts present in Gsim and selecting the minimal distance between any two concepts c1 and c2 of w1 and w2 respectively. The definition of distance is a combination of: 1) the depth of the nearest common ancestor between c1 and c2 in Gsim, 2) the intersection of the descriptive glosses of c1 and c2, and 3) the shortest distance between c1 and c2 in Gsim. A correlation of 0.913 has been achieved between the results by SSA and the human ratings reported by Miller and Charles (1991) for a dataset of 28 pairs of nouns. Furthermore, using the full dataset of 65 pairs presented by Rubenstein and Goodenough (1965), the correlation between SSA results and the known human ratings is 0.903, which is higher than all other reported algorithms for the same dataset. The high correlations of SSA with human ratings suggest that SSA would be convenient in solving several data mining and information retrieval problems.

Full Text