Abstract

In this paper, we consider the distributed version of Support Vector Machine (SVM) under the coordinator model, where all input data (i.e., points in [Formula: see text] space) of SVM are arbitrarily distributed among [Formula: see text] nodes in some network with a coordinator which can communicate with all nodes. We investigate two variants of this problem, with and without outliers. For distributed SVM without outliers, we prove a lower bound on the communication complexity and give a distributed [Formula: see text]-approximation algorithm to reach this lower bound, where [Formula: see text] is a user specified small constant. For distributed SVM with outliers, we present a [Formula: see text]-approximation algorithm to explicitly remove the influence of outliers. Our algorithm is based on a deterministic distributed top [Formula: see text] selection algorithm with communication complexity of [Formula: see text] in the coordinator model.

Highlights

  • Training a Support Vector Machine (SVM) [5] in a distributed setting is a commonly encountered problem in the big data era. This could be due to the fact that the data set is too large to be stored in a centralized site, or because the data set is collected in a distributed environment

  • A significant amount of effort has been devoted to this problem ([9, 3, 12, 14, 4, 16, 8, 18, 6]), and a number of distributed SVM algorithms with different strength have been developed

  • In this paper we present a distributed SVM algorithm that is theoretically guaranteed to have the lowest possible communication cost together with a guaranteed near-optimal solution, based on the classical Gilbert algorithm [11]

Read more

Summary

Introduction

Training a Support Vector Machine (SVM) [5] in a distributed setting is a commonly encountered problem in the big data era. One type of extensively studied algorithms in recent years are the family of incremental construction algorithms [9, 3] Such algorithms often have good performance in practice and some other nice features related to robustness and decentralization; but they generally do not have theoretical guarantee on the communication complexity, and some of them even have no quality guarantee on their solutions. These algorithms typically focus on enhancing the ability of dealing with extremely large size data sets, but in general have no quality guarantee on communication complexity There are another family of algorithms called distributed stochastic gradient descent algorithms [20]; the main issue of such algorithms is that their running time (or number of iteration) is mostly sub-optimal, and they do not have a guarantee on communication cost. We performed experiments on benchmark datasets to evaluate the performance of the algorithms (in the full version of the paper)

Equivalence between SVM and Polytope Distance
Gilbert Algorithm
Communication Complexity of Distributed SVM
Robust Distributed SVM
RGD Tree
Extending RGD Tree to Distributed Settings
Extension to Two-Class SVM
4: Recursively grow the tree in the following manner: 5
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call