Abstract

This work studies clustering algorithms which operates with ordinal or comparison-based queries (operations), a situation that arises in many active-learning applications where “dissimilarities” between data points are evaluated by humans. Typically, exact answers are costly (or difficult to obtain in large amounts) while possibly erroneous answers have low cost. Motivated by these considerations, we study algorithms with non-trivial trade-offs between the number of exact (high-cost) operations and noisy (low-cost) operations with provable performance guarantees. Specifically, we study a class of polynomial-time graph-based clustering algorithms (termed Single-Linkage) which are widely used in practice and that guarantee exact solutions for stable instances in several clustering problems (these problems are NP-hard in the worst case). We provide several variants of these algorithms using ordinal operations and, in particular, non-trivial trade-offs between the number of high-cost and low-cost operations that are used. Our algorithms still guarantee exact solutions for stable instances of k-medoids clustering, and they use a rather small number of high-cost operations, without increasing the low-cost operations too much.

Highlights

  • Clustering is a fundamental and widely studied problem in machine learning and in computational complexity as well

  • We address these questions by (i) introducing a formal model and (ii) by considering a class of clustering problems/algorithms in this model

  • This work focuses on the so-called k-medoids clustering problem, where the center of each cluster must be a point of the cluster [16,17,18,19]

Read more

Summary

Introduction

Clustering is a fundamental and widely studied problem in machine learning and in computational complexity as well. (see [4] for a nice introduction) Speaking, this algorithm first computes a (minimum) spanning tree over the data pairwise distances or dissimilarities, and removes a suitable subset of edges to obtain the optimal k-clustering. Using only ordinal information; Dealing with noisy data; Allowing expensive operations to remove errors This situation arises, for example, in semi-active learning approaches where the pairwise dissimilarity between objects (data points) is evaluated by humans via simple comparison queries (see, e.g., [5,6,7] and references therein). These are inherently subject to erroneous evaluations. What trade-offs between expensive and non-expensive (noisy) operations still allow for finding optimal solutions?

Our Contribution
Our Model
Algorithms and Bounds for k-Medoids
Techniques and Relation to Prior Work
Model and Preliminary Definitions
Stable Instances
Comparisons and Errors
Performance Evaluation
Two Algorithmic Tools
Clustering in Stable Instances
Warm-Up
Matroid Approximations Fail
D We could then
Phase 1
Exact Centroids and Exact Solutions
Approximate Centroids and Approximate Solutions
Dynamic Programming
Basic Notation and Adaptations
The Actual Algorithm
Same Cluster Queries
Small-Radius and Same-Cluster Queries
An Algorithm Using Few SCQs
The Algorithm
Extensions of Theorem 5
Query Model
Error Model
Open Questions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call