We study here the semisupervised k-clustering problem where information is available on whether pairs of objects are in the same or different clusters. This information is available either with certainty or with a limited level of confidence. We introduce the pair-wise confidence constraints clustering (PCCC) algorithm, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects. Our algorithm uses integer programming for the assignment of objects, which allows us to include relationships as hard constraints that are guaranteed to be satisfied or as soft constraints that can be violated subject to a penalty. This flexibility distinguishes our algorithm from the state of the art, in which all pair-wise constraints are considered hard or all are considered soft. We developed an enhanced multistart approach and a model-size reduction technique for the integer program that contribute to the effectiveness and efficiency of the algorithm. Unlike existing algorithms, our algorithm scales to large-scale instances with up to 60,000 objects, 100 clusters, and millions of cannot-link constraints (which are the most challenging constraints to incorporate). We compare the PCCC algorithm with state-of-the-art approaches in an extensive computational study. Even though the PCCC algorithm is more general than the state-of-the-art approaches in its applicability, it outperforms the state-of-the-art approaches on instances with all hard or all soft constraints in terms of both run time and various metrics of solution quality. The code of the PCCC algorithm is publicly available on GitHub. History: Accepted by Ram Ramesh, Area Editor for Data Science and Machine Learning. Funding: The research of D. S. Hochbaum was supported by the AI Institute NSF Award [Grant 2112533]. Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2023.0419 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2023.0419 ). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/ .
Read full abstract