Local homogeneous consistent safe semi-supervised clustering

Haitao Gan,Yingle Fan,Zhizeng Luo,Qizhong Zhang

doi:10.1016/j.eswa.2017.12.046

Abstract

Semi-supervised clustering generally assumes that prior knowledge is helpful to improve clustering performance. However, the prior knowledge may degenerate the clustering performance if one collects wrong information, such as wrong labels. Hence, it is meaningful to design a safe semi-supervised clustering method which never performs worse than the corresponding unsupervised and semi-supervised clustering methods. In this paper, we develop local homogeneous consistent safe semi-supervised clustering where class labels are given as the prior knowledge. To the best of our knowledge, it is the first time safe semi-supervised clustering has been studied. The basic idea is that the predictions of a labeled sample and its nearest homogeneous unlabeled ones should be similar when the labeled one is risky. In our algorithm, we firstly build a local graph to model the relationship between the labeled sample and its nearest homogeneous unlabeled ones through the results obtained by unsupervised clustering. A graph-based regularization term is then constructed to allow the predictions of the labeled samples to approach that of the local homogeneous neighbors. It is expected to reduce the risk of the labeled samples. Meanwhile, our algorithm positively exploits the labeled samples by restricting the corresponding outputs to be the given class labels when the labeled ones may be helpful. In this sense, the predictions of the labeled samples in our algorithm are a tradeoff between the given class labels and the predictions of local homogeneous neighbors. To verify the effectiveness of our algorithm, we conduct a series of experiments on several UCI datasets. The results show that our algorithm outperforms the corresponding unsupervised and semi-supervised clustering methods even if the wrongly labeled ratio reaches 30%. In this sense, the proposed algorithm will not only enrich the theoretical knowledge in the machine learning field, but significantly improve the practicability of semi-supervised clustering in the expert and intelligent systems.

Full Text