Abstract

This paper introduces a novel family of ontology-based similarity measures based on the Information Content (IC) theory, a detailed state of the art, a large experimental survey into ontology-based similarity measures on WordNet, and a new comparison between intrinsic and corpus-based IC models. Our experiments are based on our implementation of a large set of similarity measures, intrinsic and corpus-based IC models, which are evaluated on two known datasets and three different WordNet versions. The new measures are called weighted Jiang–Conrath distance (wJ&Cdist) and similarity (wJ&Csim), cosine-normalized Jiang–Conrath similarity (cosJ&Csim) and cosine-normalized weighted Jiang–Conrath similarity (coswJ&Csim). Two of our similarity measures outperform the state-of-the-art measures on the RG65 dataset, and one of them obtains the third overall score on all the datasets and evaluated WordNet versions. The cosine-normalized similarity measures are a non-linear normalization of the classic Jiang–Conrath (J&C) distance and the new wJ&C distance. On the other hand, the wJ&C distance is a generalization of the classic J&C distance which is based on the length of the shortest path between concepts within an IC-based weighted graph. Our measures are based on two not previously considered notions: (1) a generalization of the classic J&C distance to any type of taxonomy, based on an IC-based weighted graph derived from the conditional probabilities between child and parent concepts, and (2) a non-linear normalization function that converts the ontology-based semantic distances into similarity functions. Finally, the corpus-based IC models based on the Resnik method obtain rivaling results as regards the state-of-the-art intrinsic IC models, when they are used with some unexplored WordNet-based frequency files. Therefore, this latter fact allows us to reconsider some previous conclusions about the outperformance of the intrinsic IC models over the corpus-based ones.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call