A comparative study of data-dependent approaches without learning in measuring similarities of data objects

Sunil Aryal,Gholamreza Haffari,Kai Ming Ting,Takashi Washio

doi:10.1007/s10618-019-00660-0

Abstract

Conventional general-purpose distance-based similarity measures, such as Minkowski distance (also known as $$\ell _p$$-norm with $$p>0$$), are data-independent and sensitive to units or scales of measurement. There are existing general-purpose data-dependent measures, such as rank difference, Lin’s probabilistic measure and $$m_p$$-dissimilarity ($$p>0$$), which are not sensitive to units or scales of measurement. Although they have been shown to be more effective than the traditional distance measures, their characteristics and relative performances have not been investigated. In this paper, we study the characteristics and relationships of different general-purpose data-dependent measures. We generalise $$m_p$$-dissimilarity where $$p\ge 0$$ by introducing $$m_0$$-dissimilarity and show that it is a generic data-dependent measure with data-dependent self-similarity, of which rank difference and Lin’s measure are special cases with data-independent self-similarity. We evaluate the effectiveness of a wide range of general-purpose data-dependent and data-independent measures in the content-based information retrieval and kNN classification tasks. Our findings show that the fully data-dependent measure of $$m_p$$-dissimilarity is a more effective alternative to other data-dependent and commonly-used distance-based similarity measures as its task-specific performance is more consistent across a wide range of datasets.

Full Text