Abstract

Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as k-center, k-median, and k-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, O(ncdot {{,mathrm{polylog},}}n)-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in O(ncdot {{,mathrm{polylog},}}n)-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling.

Highlights

  • Correlation Clustering is a widely studied model of clustering within graphs where edges are marked as positive or negative

  • The correlation clustering problem was initially proposed for complete unweighted graphs

  • The correlation clustering problem was first formulated as an optimization problem by Bansal, Blum and Chawla [12]

Read more

Summary

Introduction

Correlation Clustering is a widely studied model of clustering within graphs where edges are marked as positive or negative. The correlation clustering problem was initially proposed for complete unweighted graphs. The motivation for this formulation is an intuitive one: there are many. Non-adaptive sampling algorithms for correlation clustering can be implemented in the data stream model, as applied by Ailon and Karnin [8], to construct additive approximations. Chierichetti, Dalvi and Kumar [20] presented the first multiplicative approximation data stream algorithm: a polynomial-time (3 + ). Using space roughly proportional to the number of nodes can be shown to be necessary for solving many natural graph problems including, it will turn out, correlation clustering. For a recent survey of semi-streaming algorithms and graph sketching see [43]

Computational Model
Techniques and Results
Basic Data Structures and Applications
First Data Structure
Second Data Structure
Third Data Structure
A Dual Primal Approach
From Rounding Algorithms to Oracles
Streaming Multicut Problem
Multipass Algorithms
Lower Bounds
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call