Abstract
The recent success of Deep Learning (DL) in a board range of AI services has led to a surging amount of DL workloads in production clusters. To support DL jobs at scale, the parameter server (PS) architecture is the most popular approach for distributing the computation in a compute cluster. Concurrent DL jobs consisting of PS tasks and worker tasks are typically launched on available compute nodes by a cluster resource manager to ensure high cluster resource utilization. As a PS needs to distribute model updates to every remote worker, its communication has very large fan-out. We observe that network contention among colocated PSes would cause stragglers among workers, resulting in application performance degradation and resource under-utilization. To mitigate the straggler effect, we propose TensorLights, which introduces traffic prioritization at host NICs to manage traffic contention among PSes. We evaluate TensorLights experimentally and show that it effectively mitigates stragglers, improves the average completion time of DL applications by up to 31%, and increases resource utilization. TensorLights is highly practical as it provides benefits without needing changes to the DL software stack.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.