This paper introduces a new family of intrinsic and corpus-based Information Content (IC) models for ontology-based similarity measures based on the IC theory, a detailed state of the art, an experimental survey of IC models and IC-based similarity measures on WordNet, and a comparison between intrinsic and corpus-based IC models. The family of IC models is made up of five intrinsic IC models, called CondProbHypo, CondProbUniform, CondProbLeaves, CondProbLogistic, and CondProbCosine, and one corpus-based IC model called CondProbCorpus which completes the family. The proposed IC models rely on two previously unconsidered notions: (1) the preservation of the probabilistic structure of the taxonomy associated to the conditional probabilities between child and parent concepts, and (2) the explicit consideration of a cognitive similarity notion in the definition of the IC model. The family of IC models defines a new method for the proposal of new intrinsic IC models based on the exploration of other alternatives for the intrinsic estimation of the conditional probabilities between child and parent concepts. Our work is inspired by an unexplored relationship between the Jiang–Conrath distance and a shortest path on an IC-based weighted graph, derived from the conditional probabilities between concepts, as well as certain cognitive evidence about the perception distance between concepts. The new IC models obtain results comparable to the state of the art and satisfy a set of well-founded structure axioms. In addition, we prove that most of intrinsic IC models and IC-based similarity measures do not show a significant statistical difference as regards a baseline corpus-based IC model and the Jiang–Conrath similarity, with the exception of the overall outperformance shown by the Sánchez et al. (2012) IC model and the cosJ&C similarity measure, which has recently been introduced by the authors.
Read full abstract