ID for data with multiple clusters

Ismail Ari,A T Cemgil,L Akarun

doi:10.1109/siu.2013.6531308

Abstract

Interpolative decomposition (ID) is a matrix factorization which aims to represent the data matrix via a subset of its own columns. These selected columns are supposed to hold the salient features expressing the data. A very common ID approach in the literature is based on importance sampling where a statistical leverage score is computed for each column and K columns are randomly selected using these scores. These randomized methods aim a better low-rank approximation of the matrix by seeking for the columns that express the range of the matrix the best. This makes ID a good alternative to Singular Value Decomposition (SVD) since it favors sparsity and the bases correspond to real data points. However, the columns leading to the best low-rank approximation are usually not the ones in terms of representativeness if the underlying data is composed of several clusters which is very common in real life. In this paper, we introduce an alternative ID approach based on clustering. We employ K-medoids to be employed as an ID method for better interpretability and respresentativeness. We apply ID on handwritten digit recognition and supply comparative results of the proposed approach to the state-of-the-art method in the literature. We show its superiority in terms of representativeness of the data. We demonstrate that most of the data can be discarded without compromising the accuracy.

Full Text