Abstract
Constrained spectral clustering (CSC) method can greatly improve the clustering accuracy with the incorporation of constraint information into spectral clustering and thus has been paid academic attention widely. In this paper, we propose a fast CSC algorithm via encoding landmark-based graph construction into a new CSC model and applying random sampling to decrease the data size after spectral embedding. Compared with the original model, the new algorithm has the similar results with the increase of its model size asymptotically; compared with the most efficient CSC algorithm known, the new algorithm runs faster and has a wider range of suitable data sets. Meanwhile, a scalable semisupervised cluster ensemble algorithm is also proposed via the combination of our fast CSC algorithm and dimensionality reduction with random projection in the process of spectral ensemble clustering. We demonstrate by presenting theoretical analysis and empirical results that the new cluster ensemble algorithm has advantages in terms of efficiency and effectiveness. Furthermore, the approximate preservation of random projection in clustering accuracy proved in the stage of consensus clustering is also suitable for the weighted k-means clustering and thus gives the theoretical guarantee to this special kind of k-means clustering where each point has its corresponding weight.
Highlights
With the arrival of the big data era, data has become an important asset
Our contributions can be divided into three parts: the first part is the proposition of a fast constrained spectral clustering (CSC) algorithm which is suitable for a wide range of data sets; the second part is the analysis of the effect of random projection on the spectral ensemble clustering; the third part is the proposition of a scalable semisupervised cluster ensemble algorithm
To handle large scale data sets, we propose a fast CSC algorithm
Summary
With the arrival of the big data era, data has become an important asset. How to analyse the large scale data efficiently is becoming a big challenge [1, 2]. Our contributions can be divided into three parts: the first part is the proposition of a fast CSC algorithm which is suitable for a wide range of data sets; the second part is the analysis of the effect of random projection on the spectral ensemble clustering; the third part is the proposition of a scalable semisupervised cluster ensemble algorithm. (i) We propose a fast CSC algorithm whose space and time complexities are linear with the size of a data set: we compress the size of the original model proposed by Cucuringu et al [17] by the encoding of landmarkbased graph construction and improve the efficiency further via random sampling in the process of kmeans clustering.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have