Abstract

Co-clustering, that is partitioning a numerical matrix into “homogeneous” submatrices, has many applications ranging from bioinformatics to election analysis. Many interesting variants of co-clustering are NP-hard. We focus on the basic variant of co-clustering where the homogeneity of a submatrix is defined in terms of minimizing the maximum distance between two entries. In this context, we spot several NP-hard, as well as a number of relevant polynomial-time solvable special cases, thus charting the border of tractability for this challenging data clustering problem. For instance, we provide polynomial-time solvability when having to partition the rows and columns into two subsets each (meaning that one obtains four submatrices). When partitioning rows and columns into three subsets each, however, we encounter NP-hardness, even for input matrices containing only values from {0, 1, 2}.

Highlights

  • Co-clustering, known as bi-clustering [1], performs a simultaneous clustering of the rows and columns of a data matrix

  • A parameterized problem, where each instance consists of the “classical” problem instance I and an integer ρ called parameter, is fixed-parameter tractable (FPT) if there is a computable function f and an algorithm solving any instance in f (ρ) · | I |O(1)

  • We observed that C O -C LUSTERING ∞ is easy to solve for binary input matrices (Observation 1)

Read more

Summary

Introduction

Co-clustering, known as bi-clustering [1], performs a simultaneous clustering of the rows and columns of a data matrix. The problem is, given a numerical input matrix A, to partition the rows and columns of A into subsets minimizing a given cost function (measuring “homogeneity”). For a given subset I of rows and a subset J of columns, the corresponding cluster consists of all entries aij with i ∈ I and j ∈ J. The cost function usually defines homogeneity in terms of distances (measured in some norm) between the entries of each cluster. Note that the variant where clusters are allowed to “overlap”, meaning that some rows and columns are contained in multiple clusters, has been studied [1]. We focus on the non-overlapping variant, which can be stated as follows

C O -C LUSTERING L
Related Work
Our Contributions
Formal Definitions and Preliminaries
Problem Definition
Parameterized Algorithmics
Intractability Results
Constant Number of Clusters
Constant Number of Rows
Clustering into Consecutive Clusters
Tractability Results
Reduction to CNF-SAT Solving
Polynomial-Time Solvability
Fixed-Parameter Tractability
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call