Abstract

Witnessing the tremendous development of machine learning technology, emerging machine learning applications impose challenges of using domain knowledge to improve the accuracy of clustering provided that clustering suffers a compromising accuracy rate despite its advantage of fast procession. In this paper, we model domain knowledge (i.e., background knowledge or side information), respecting some applications as must-link and cannot-link sets, for the sake of collaborating with k-means for better accuracy. We first propose an algorithm for constrained k-means, considering only must-links. The key idea is to consider a set of data points constrained by the must-links as a single data point with a weight equal to the weight sum of the constrained points. Then, for clustering the data points set with cannot-link, we employ minimum-weight matching to assign the data points to the existing clusters. At last, we carried out a numerical simulation to evaluate the proposed algorithms against the UCI datasets, demonstrating that our method outperforms the previous algorithms for constrained k-means as well as the traditional k-means regarding the clustering accuracy rate although with a slightly compromised practical runtime.

Highlights

  • As one of the most renowned unsupervised machine learning methods, clustering has been widely used in many research areas, including data science and natural language processing, etc

  • We proposed an algorithm for the constrained k-means clustering problem regarding must-link and cannot-link constraints

  • Propose a framework to incorporate must-link and cannot-link constraints with the k-means++ algorithm; Devise a method to cluster the points of cannot-link via novelly employing minimum weight matching and to merge the set of data points confined by must-links as a single point; Carry out experiments to evaluate the practical performance of the proposed algorithms against the UCI datasets, demonstrating that our algorithms outperform the previous algorithm at a rate of 65% regarding the accuracy rate

Read more

Summary

Introduction

As one of the most renowned unsupervised machine learning methods, clustering has been widely used in many research areas, including data science and natural language processing, etc. It is attracting numerous research interests from both academic and industrial communities. For many applications in machine learning and data mining involving procession of large amounts of data, there might exist a lot of unlabeled data It would consume a lot of time and resources to manually label these data in most cases. If the two sample points are with a must-link constraint, x1 and x2 must be clustered into a cluster; in contrast, if they satisfy a cannot-link they cannot be clustered into the same cluster. The within-cluster sum of squares is minimized as in the following: k

Related Work
Our Results
Organization
Preliminaries and Problem Statement
Constrained k-Means Clustering Algorithm with Incidental Information
7: EndFor
Experimental Evaluation
Evaluation Approaches
Experimental Dataset and Statistics Information
Comparison of Practical Performance
Comparison of Runtime
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.