Abstract

A distributed machine learning platform needs to recruit many heterogeneous worker nodes to finish computation simultaneously. As a result, the overall performance may be degraded due to straggling workers. By introducing redundancy into computation, coded machine learning can effectively improve the runtime performance by recovering the final computation result through the first $k$ (out of the total $n$ ) workers who finish computation. While existing studies focus on designing efficient coding schemes, the issue of designing proper incentives to encourage worker participation is still under-explored. This paper studies the platform’s optimal incentive mechanism for motivating proper workers’ participation in coded machine learning, despite the multi-dimensional incomplete information about heterogeneous workers’ computation performances and costs. A key contribution of this work is to summarize workers’ multi-dimensional heterogeneity as a one-dimensional metric, which guides the platform’s efficient selection of workers under incomplete information with a linear computation complexity. Although the exact overall runtime is intractable, we characterize the platform’s (asymptotically) optimal load assignment to heterogeneous workers in coded machine learning. When the platform has incomplete information about workers’ costs, it is optimal to assign loads only based on workers’ computation performances; when the platform further lacks workers’ computation performance information, it is optimal to design the loads to be cost-dependent and performance-dependent.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call