Abstract
Witnessing the tremendous development of machine learning technology, emerging machine learning applications impose challenges of using domain knowledge to improve the accuracy of clustering provided that clustering suffers a compromising accuracy rate despite its advantage of fast procession. In this paper, we model domain knowledge (i.e., background knowledge or side information), respecting some applications as must-link and cannot-link sets, for the sake of collaborating with k-means for better accuracy. We first propose an algorithm for constrained k-means, considering only must-links. The key idea is to consider a set of data points constrained by the must-links as a single data point with a weight equal to the weight sum of the constrained points. Then, for clustering the data points set with cannot-link, we employ minimum-weight matching to assign the data points to the existing clusters. At last, we carried out a numerical simulation to evaluate the proposed algorithms against the UCI datasets, demonstrating that our method outperforms the previous algorithms for constrained k-means as well as the traditional k-means regarding the clustering accuracy rate although with a slightly compromised practical runtime.
Highlights
As one of the most renowned unsupervised machine learning methods, clustering has been widely used in many research areas, including data science and natural language processing, etc
We proposed an algorithm for the constrained k-means clustering problem regarding must-link and cannot-link constraints
Propose a framework to incorporate must-link and cannot-link constraints with the k-means++ algorithm; Devise a method to cluster the points of cannot-link via novelly employing minimum weight matching and to merge the set of data points confined by must-links as a single point; Carry out experiments to evaluate the practical performance of the proposed algorithms against the UCI datasets, demonstrating that our algorithms outperform the previous algorithm at a rate of 65% regarding the accuracy rate
Summary
As one of the most renowned unsupervised machine learning methods, clustering has been widely used in many research areas, including data science and natural language processing, etc. It is attracting numerous research interests from both academic and industrial communities. For many applications in machine learning and data mining involving procession of large amounts of data, there might exist a lot of unlabeled data It would consume a lot of time and resources to manually label these data in most cases. If the two sample points are with a must-link constraint, x1 and x2 must be clustered into a cluster; in contrast, if they satisfy a cannot-link they cannot be clustered into the same cluster. The within-cluster sum of squares is minimized as in the following: k
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.