Combining statistical similarity measures for automatic induction of semantic classes

A Pangos,A Potamianos,E Fosler-Lussier,E Iosif

doi:10.1109/asru.2005.1566510

Abstract

In this paper, an unsupervised semantic class induction algorithm is proposed that is based on the principle that similarity of context implies similarity of meaning. Two semantic similarity metrics that are variations of the vector product distance are used in order to measure the semantic distance between words and to automatically generate semantic classes. The first metric computes wide-context similarity between words using a bag-of-words model, while the second metric computes similarity using a bigram language model. A hybrid metric that is defined as the linear combination of the wide and narrow-context metrics is also proposed and evaluated. To cluster words into semantic classes an iterative clustering algorithm is used. The semantic metrics are evaluated on two corpora: a semantically heterogeneous Web news domain (HR-Net) and an application-specific travel reservation corpus (ATIS). For the hybrid metric, semantic class member precision of 85% is achieved at 17% recall for the HR-Net task and precision of 85% is achieved at 55% recall for the ATIS task

Full Text