Clustering is one of the essential tools for data mining since it reveals the natural structures of the unlabeled data. Many clustering algorithms have been proposed in the last decades. However, few of them are designed to adapt prior knowledge that is available in many real applications, such as the sizes of clusters. In this paper, we propose a novel iterative clustering algorithm that can impose the constraints on the sizes of clusters. Given an unordered set of cluster size constraints, the proposed method minimizes the mean squared error (MSE) while simultaneously considers the size constraints. Each iteration of the proposed method consists of two steps, namely an assignment step and an update step. In the assignment step, the observations are assigned into clusters under the size constraints. The assignment task is modeled as an integer linear programming (ILP) problem. We prove that part of the constraint matrix of this ILP problem is total unimodular. Therefore, the integer constraints on most of the variables can be omitted so that the problem would become a mixed integer programming (MILP) problem which is much easier to solve. In the update step, new cluster centroids will be updated as the centers of the observations in the corresponding clusters. Experiments on UCI data sets indicate that (1) imposing the size constraints as proposed could improve the clustering performance; (2) compared with the state-of-the-art size constrained clustering methods, the proposed method could efficiently derive better clustering results.
Read full abstract