Correlation Clustering in Data Streams

Kook Jin Ahn,Graham Cormode,Andrew Mcgregor,Anthony Wirth,Sudipto Guha

doi:10.1007/s00453-021-00816-9

Abstract

Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as k-center, k-median, and k-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, O(ncdot {{,mathrm{polylog},}}n)-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in O(ncdot {{,mathrm{polylog},}}n)-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling.

Highlights

Correlation Clustering is a widely studied model of clustering within graphs where edges are marked as positive or negative
The correlation clustering problem was initially proposed for complete unweighted graphs
The correlation clustering problem was first formulated as an optimization problem by Bansal, Blum and Chawla [12]

Summary

Introduction

Correlation Clustering is a widely studied model of clustering within graphs where edges are marked as positive or negative. The correlation clustering problem was initially proposed for complete unweighted graphs. The motivation for this formulation is an intuitive one: there are many. Non-adaptive sampling algorithms for correlation clustering can be implemented in the data stream model, as applied by Ailon and Karnin [8], to construct additive approximations. Chierichetti, Dalvi and Kumar [20] presented the first multiplicative approximation data stream algorithm: a polynomial-time (3 + ). Using space roughly proportional to the number of nodes can be shown to be necessary for solving many natural graph problems including, it will turn out, correlation clustering. For a recent survey of semi-streaming algorithms and graph sketching see [43]

Computational Model

Techniques and Results

Basic Data Structures and Applications

First Data Structure

Second Data Structure

Third Data Structure

A Dual Primal Approach

From Rounding Algorithms to Oracles

Streaming Multicut Problem

Multipass Algorithms

Lower Bounds

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithmica	Publication Date: Mar 13, 2021
Citations: 31	License type: open-access

R Discovery Prime

R Discovery Prime

Correlation Clustering in Data Streams

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithmica

Lead the way for us

Similar Papers

Correlation Clustering in Data Streams
...
-
, et. al. ...
06 Jul 2015
06 Jul 2015

Spanners and sparsifiers in dynamic streams
Michael Kapralov ... David Woodruff
-
Michael Kapralov, et. al.Michael Kapralov ... David Woodruff
15 Jul 2014
15 Jul 2014

A statistical approach for clustering in streaming data
Sattar Hashemi ... Niloofar Mozafari
Artificial Intelligence Research | VOL. 3
Sattar Hashemi, et. al.Sattar Hashemi ... Niloofar Mozafari
09 Jan 2014
Artificial Intelligence Research | VOL. 3

RCD+: A Partitioning Method for Data Streams Based on Multiple Queries
Chunkai Wang ... Fan Liao
IEEE Access | VOL. 8
Chunkai Wang, et. al.Chunkai Wang ... Fan Liao
01 Jan 2020
IEEE Access | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Correlation Clustering in Data Streams

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithmica