Abstract

We consider a novel distributed learning problem: A server receives potentially unlimited data from clients in a sequential manner, but only a small initial fraction of these data are labeled. Because communication bandwidth is expensive, each client is limited to sending the server only a small (high-priority) fraction of the unlabeled data it generates, and the server is limited in the amount of prioritization hints it sends back to the client. The goal is for the server to learn a good model of all the client data from the labeled and unlabeled data it receives. This setting is frequently encountered in real-world applications and has the characteristics of online, semi-supervised, and active learning. However, previous approaches are not designed for the client-server setting and do not hold the promise of reducing communication costs. We present a novel framework for solving this learning problem in an effective and communication-efficient manner. On the server side, our solution combines two diverse learners working collaboratively, yet in distinct roles, on the partially labeled data stream. A compact, online graph-based semi-supervised learner is used to predict labels for the unlabeled data arriving from the clients. Samples from this model are used as ongoing training for a linear classifier. On the client side, our solution prioritizes data based on an active-learning metric that favors instances that are close to the classifier's decision hyperplane and yet far from each other. To reduce communication, the server sends the classifier's weight-vector to the client only periodically. Experimental results on real-world data sets show that this particular combination of techniques outperforms other approaches, and in particular, often outperforms (communication expensive) approaches that send all the data to the server.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call