Abstract
Co-clustering, that is partitioning a numerical matrix into “homogeneous” submatrices, has many applications ranging from bioinformatics to election analysis. Many interesting variants of co-clustering are NP-hard. We focus on the basic variant of co-clustering where the homogeneity of a submatrix is defined in terms of minimizing the maximum distance between two entries. In this context, we spot several NP-hard, as well as a number of relevant polynomial-time solvable special cases, thus charting the border of tractability for this challenging data clustering problem. For instance, we provide polynomial-time solvability when having to partition the rows and columns into two subsets each (meaning that one obtains four submatrices). When partitioning rows and columns into three subsets each, however, we encounter NP-hardness, even for input matrices containing only values from {0, 1, 2}.
Highlights
Co-clustering, known as bi-clustering [1], performs a simultaneous clustering of the rows and columns of a data matrix
A parameterized problem, where each instance consists of the “classical” problem instance I and an integer ρ called parameter, is fixed-parameter tractable (FPT) if there is a computable function f and an algorithm solving any instance in f (ρ) · | I |O(1)
We observed that C O -C LUSTERING ∞ is easy to solve for binary input matrices (Observation 1)
Summary
Co-clustering, known as bi-clustering [1], performs a simultaneous clustering of the rows and columns of a data matrix. The problem is, given a numerical input matrix A, to partition the rows and columns of A into subsets minimizing a given cost function (measuring “homogeneity”). For a given subset I of rows and a subset J of columns, the corresponding cluster consists of all entries aij with i ∈ I and j ∈ J. The cost function usually defines homogeneity in terms of distances (measured in some norm) between the entries of each cluster. Note that the variant where clusters are allowed to “overlap”, meaning that some rows and columns are contained in multiple clusters, has been studied [1]. We focus on the non-overlapping variant, which can be stated as follows
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have