Label Propagated Nonnegative Matrix Factorization for Clustering

Long Lan,Tongliang Liu,Chuanfu Xu,Xiang Zhang,Zhigang Luo

doi:10.1109/tkde.2020.2982387

Abstract

Semi-supervised learning (SSL) that utilizes plenty of unlabeled examples to boost the performance of learning from limited labeled examples is a powerful learning paradigm with widely real-world applications such as information retrieval and document clustering. Label propagation (LP) is a popular SSL method which propagates labels through the dataset along high density areas defined by unlabeled examples, but it is fragile to bridge examples. Semi-supervised K-Means uses labeled examples to initialize clustering centers to separate different examples, however, semi-supervised K-Means fails in the situation of imbalanced issues, that is, the example size of each class varies significantly. This paper proposes a novel label propagated nonnegative matrix factorization method (LPNMF) to handle clean labeled but biased data and its extension LPNMF-E to handle noisy labeled data based on the framework of NMF. LPNMF decomposes the whole dataset into the product of a basis matrix and a coefficient matrix. To propagate labels to unlabeled examples, LPNMF regards the class indicators of labeled examples as their coefficients and iteratively updates both basis matrix and coefficients of unlabeled examples. LPNMF absorbs the merits from both semi-supervised K-Means and label propagation to handle their respective shortages. Specifically, on the one hand, LPNMF learns representative clustering centers based on the distribution of the dataset, similar to semi-supervised K-means, and thus is robust to the bridge examples. On the other hand, LPNMF pushes labels according to the affinity between examples, similar to label propagation, and thus relieves the biased problem. Moreover, we introduce a LPNMF extension to handle the noisy label case. LPNMF-E relaxes the constraint of labeled examples. Since the label of each labeled example also obtains label information from the global distribution of the whole dataset and local manifold of its neighbors, LPNMF-E outputs reliable class indicators even if a portion of examples are incorrectly labeled. Theoretical analyses for the generalization ability of our proposed models are also provided. Experimental results on both clean and noisy labeled datasets confirm the effectiveness of LPNMF and LPNMF-E compared with both LP and the representative semi-supervised K-Means algorithms.

Full Text