Improved Constrained k-Means Algorithm for Clustering with Domain Knowledge

Peihuang Huang,Longkun Guo,Pei Yao,Huihong Peng,Zhendong Hao

doi:10.3390/math9192390

Peihuang Huang, Longkun Guo + Show 3 more

Open Access

https://doi.org/10.3390/math9192390

Copy DOI

Abstract

Witnessing the tremendous development of machine learning technology, emerging machine learning applications impose challenges of using domain knowledge to improve the accuracy of clustering provided that clustering suffers a compromising accuracy rate despite its advantage of fast procession. In this paper, we model domain knowledge (i.e., background knowledge or side information), respecting some applications as must-link and cannot-link sets, for the sake of collaborating with k-means for better accuracy. We first propose an algorithm for constrained k-means, considering only must-links. The key idea is to consider a set of data points constrained by the must-links as a single data point with a weight equal to the weight sum of the constrained points. Then, for clustering the data points set with cannot-link, we employ minimum-weight matching to assign the data points to the existing clusters. At last, we carried out a numerical simulation to evaluate the proposed algorithms against the UCI datasets, demonstrating that our method outperforms the previous algorithms for constrained k-means as well as the traditional k-means regarding the clustering accuracy rate although with a slightly compromised practical runtime.

Highlights

As one of the most renowned unsupervised machine learning methods, clustering has been widely used in many research areas, including data science and natural language processing, etc
We proposed an algorithm for the constrained k-means clustering problem regarding must-link and cannot-link constraints
Propose a framework to incorporate must-link and cannot-link constraints with the k-means++ algorithm; Devise a method to cluster the points of cannot-link via novelly employing minimum weight matching and to merge the set of data points confined by must-links as a single point; Carry out experiments to evaluate the practical performance of the proposed algorithms against the UCI datasets, demonstrating that our algorithms outperform the previous algorithm at a rate of 65% regarding the accuracy rate

Summary

Introduction

As one of the most renowned unsupervised machine learning methods, clustering has been widely used in many research areas, including data science and natural language processing, etc. It is attracting numerous research interests from both academic and industrial communities. For many applications in machine learning and data mining involving procession of large amounts of data, there might exist a lot of unlabeled data It would consume a lot of time and resources to manually label these data in most cases. If the two sample points are with a must-link constraint, x1 and x2 must be clustered into a cluster; in contrast, if they satisfy a cannot-link they cannot be clustered into the same cluster. The within-cluster sum of squares is minimized as in the following: k

Related Work

Our Results

Organization

Preliminaries and Problem Statement

Constrained k-Means Clustering Algorithm with Incidental Information

7: EndFor

Experimental Evaluation

Evaluation Approaches

Experimental Dataset and Statistics Information

Comparison of Practical Performance

Comparison of Runtime

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mathematics	Publication Date: Sep 26, 2021
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Improved Constrained k-Means Algorithm for Clustering with Domain Knowledge

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics

Lead the way for us

Similar Papers

Extreme-Centroid Tree for Outlier Detection
Panote Songwattanasiri ... Krung Sinapiromsaran
-
Panote Songwattanasiri, et. al.Panote Songwattanasiri ... Krung Sinapiromsaran
12 Nov 2015
12 Nov 2015

Efficient Algorithms for Constrained Clustering with Side Information
Zhendong Hao ... Pei Yao
-
Zhendong Hao, et. al.Zhendong Hao ... Pei Yao
01 Jan 2020
01 Jan 2020

A Parallel Clustering Algorithm Implementation Based on Apache Mahout
Xia Daoping ... Long Yubo
-
Xia Daoping, et. al.Xia Daoping ... Long Yubo
01 Jul 2016
01 Jul 2016

Machine learning with knowledge constraints for process optimization of open-air perovskite solar cell manufacturing
Zhe Liu ... Thomas W Colburn
Joule | VOL. 6
Zhe Liu, et. al.Zhe Liu ... Thomas W Colburn
01 Apr 2022
Joule | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improved Constrained k-Means Algorithm for Clustering with Domain Knowledge

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics