Abstract

A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distance-based objective such as the k -median, k -means, or min-sum score. For applications such as clustering proteins by function or clustering images by subject, the implicit hope in taking this approach is that the optimal solution for the chosen objective will closely match the desired “target” clustering (e.g., a correct clustering of proteins by function or of images by who is in them). However, most distance-based objectives, including those mentioned here, are NP-hard to optimize. So, this assumption by itself is not sufficient, assuming P ≠ NP, to achieve clusterings of low-error via polynomial time algorithms. In this article, we show that we can bypass this barrier if we slightly extend this assumption to ask that for some small constant c , not only the optimal solution, but also all c -approximations to the optimal solution, differ from the target on at most some ϵ fraction of points—we call this (c,ϵ)-approximation-stability . We show that under this condition, it is possible to efficiently obtain low-error clusterings even if the property holds only for values c for which the objective is known to be NP-hard to approximate. Specifically, for any constant c > 1, (c,ϵ) -approximation-stability of k -median or k -means objectives can be used to efficiently produce a clustering of error O (ϵ) with respect to the target clustering, as can stability of the min-sum objective if the target clusters are sufficiently large. Thus, we can perform nearly as well in terms of agreement with the target clustering as if we could approximate these objectives to this NP-hard value.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.