Coresets and streaming algorithms for the k-means problem and related clustering objectives

Melanie Schmidt

doi:10.17877/de290r-43

Abstract

The continuing technological advances in different areas represent a challenge for researchers in computer science and in particular in the area of algorithms and theory. The gap between processing speed and data volume increases constantly, even though the performance of computers and their central processing units increases at a fast rate. This is because the data that surrounds us multiplies at an even more rapid pace. One example for the phenomenon is the Large Hadron Collider (CERN) that generates more than half a gigabyte of data every second. Even algorithms with linear running time are here too slow if they need random access to the data. Data stream algorithms are algorithms that only need one pass over the data to (approximately) solve a problem. Their memory usage is usually polynomial in the logarithm of the input size. Ideally, a data stream algorithm can process the data directly while it is created. In my thesis, I consider k-means clustering. Given n points in the d-dimensional Euclidean space R, the k-means problem is to compute k centers which minimize the sum of the squared distances of all points to their closest center. The centers can be chosen arbitrarily from R. For a given solution, i. e., a set of k centers, we say that the sum of the squared distances is the k-means cost of this solution. The k-means problem has been studied for sixty years and often occurs in machine learning, also as a subproblem. In the context of data streams, a popular technique to solve the k-means problem is the computation of coresets. A coreset for a point set P is a (usually much smaller) point set S which has approximately the same cost as P for any possible solution. More precisely and defined for the k-means problem, a (1 + e)-coreset for an e ∈ (0, 1) is a set S that satisfies that the cost of S for any set of k centers C is at least an e-fraction off the cost of P with the same centers C. A coreset computation is often first designed as a polynomial algorithm with random access to the data. Then, the algorithm is converted into a data stream algorithm by using a technique which is known as Merge-and-Reduce. By using Merge-and-Reduce, the memory usage of the algorithm is usually increased by a factor which is polynomial in log n. In joint work with Hendrik Fichtenberger, Marc Bury (ne Gille), Chris Schwiegelshohn and Christian Sohler, I developed a data stream algorithm for the k-means problem which does not use Merge-and-Reduce. It processes the input points one by one and directly inserts them into an appropriate data structure. We use a data structure which is used in BIRCH (Zhang, Ramakrishnan, Livny, 1997), an algorithm which is very popular in practical applications. By analyzing and improving the data structure, we could develop an algorithm which computes a (1 + e)-coreset in the data stream model and that uses pointwise updates. Our algorithm is named BICO as a combination of BIRCH and the term coreset. The memory usage of BICO is bounded by O(k · log n · e−(d+1)) if the dimension of the input points is a constant. We implemented a slightly modified version of BICO and combined it with an algorithm for the k-means problem which is known for its good results in practical applications. In an experimental study, we verified that the combined implementation computes solutions with high quality while it is much faster than other implementations that compute solutions of high quality. Our work was published at the

Full Text