Sparse convex optimization methods for machine learning

Martin Jäggi

doi:10.3929/ethz-a-007050453

Abstract

Convex optimization is at the core of many of today’s analysis tools for large datasets, and in particular machine learning methods. In this thesis we will study the general setting of optimizing (minimizing) a convex function over a compact convex domain. In the first part of this thesis, we study a simple iterative approximation algorithm for that class of optimization problems, based on the classical method by Frank & Wolfe. The algorithm only relies on supporting hyperplanes to the function that we need to optimize. In each iteration, we move slightly towards a point which (approximately) minimizes the linear function given by the supporting hyperplane at the current point, where the minimum is taken over the original optimization domain. In contrast to gradient-descent-type methods, this algorithm does not need any projection steps in order to stay inside the optimization domain. Our framework generalizes the sparse greedy algorithm of Frank & Wolfe and its recent primal-dual analysis by Clarkson (and the low-rank SDP approach by Hazan) to arbitrary compact convex domains. Analogously, we give a convergence proof guaranteeing e-small error — which in our context is the duality gap — after O( 1e ) iterations. This method allows us to understand the sparsity of approximate solutions for any `1-regularized convex optimization problem (and for optimization over the simplex), expressed as a function of the approximation quality. Here we obtain matching upper and lower bounds of Θ ( 1 e ) for the sparsity. The same bounds apply to low-rank semidefinite optimization with bounded trace, showing that rank O ( 1 e ) is best possible here as well. For some classes of geometric optimization problems, our algorithm has a simple geometric interpretation, which is also known as the coreset concept. Here we will study linear classifiers such as support vector machines (SVM) and perceptrons, as well as general distance computations between convex hulls (or polytopes). Here the framework will allow us to understand the sparsity of SVM solutions, here being the number of support vectors, in terms of the required approximation quality.

Full Text