Abstract
This work studies clustering algorithms which operates with ordinal or comparison-based queries (operations), a situation that arises in many active-learning applications where “dissimilarities” between data points are evaluated by humans. Typically, exact answers are costly (or difficult to obtain in large amounts) while possibly erroneous answers have low cost. Motivated by these considerations, we study algorithms with non-trivial trade-offs between the number of exact (high-cost) operations and noisy (low-cost) operations with provable performance guarantees. Specifically, we study a class of polynomial-time graph-based clustering algorithms (termed Single-Linkage) which are widely used in practice and that guarantee exact solutions for stable instances in several clustering problems (these problems are NP-hard in the worst case). We provide several variants of these algorithms using ordinal operations and, in particular, non-trivial trade-offs between the number of high-cost and low-cost operations that are used. Our algorithms still guarantee exact solutions for stable instances of k-medoids clustering, and they use a rather small number of high-cost operations, without increasing the low-cost operations too much.
Highlights
Clustering is a fundamental and widely studied problem in machine learning and in computational complexity as well
We address these questions by (i) introducing a formal model and (ii) by considering a class of clustering problems/algorithms in this model
This work focuses on the so-called k-medoids clustering problem, where the center of each cluster must be a point of the cluster [16,17,18,19]
Summary
Clustering is a fundamental and widely studied problem in machine learning and in computational complexity as well. (see [4] for a nice introduction) Speaking, this algorithm first computes a (minimum) spanning tree over the data pairwise distances or dissimilarities, and removes a suitable subset of edges to obtain the optimal k-clustering. Using only ordinal information; Dealing with noisy data; Allowing expensive operations to remove errors This situation arises, for example, in semi-active learning approaches where the pairwise dissimilarity between objects (data points) is evaluated by humans via simple comparison queries (see, e.g., [5,6,7] and references therein). These are inherently subject to erroneous evaluations. What trade-offs between expensive and non-expensive (noisy) operations still allow for finding optimal solutions?
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have