Abstract
In recent years, the pace of innovations in the fields of machine learning (ML) has accelerated, researchers in SysML have created algorithms and systems that parallelizeML training over multiple devices or computational nodes. As ML models become more structurally complex, many systems have struggled to provide allround performance on a variety of models. Particularly, ML scale-up is usually underestimated in terms of the amount of knowledge and time required to map from an appropriate distribution strategy to the model. Applying parallel training systems tocomplex models adds nontrivial development overheads in addition to model prototyping, and often results in lower-than-expected performance. This thesis identifies and addresses research challenges in both usability and performance in parallel ML techniques and system implementations. The first part of this thesis presents a simple design principle, adaptive parallelism, that applies suitable parallelization techniques to model building blocks (e.g. layers) according to their specific ML properties. Following it, we derive a series of optimizations and implementations optimizing different aspects of ML parallelization. We examine them and show that they significantly boost the efficiency or scalability of ML training on clusters 2-10x in their applicable scenarios. Generalizing this methodology, this second part of this thesis formulates the ML parallelization as an end-to-end optimization problem, and seeks to solve it automatically, for two broad paradigms of ML parallelization tasks: single-node dynamicbatching and distributed ML parallelisms. We present principled representations to express the two classes of ML parallelisms, along with composable system architectures,Cavs and AutoDist, respectively. They enable rapid compositions of parallelization strategies for unseen models, improve parallelization performance, and simplify parallel ML programming. On top of them, the third part of this thesis presents an automatic parallelization framework, AutoSync, to automatically optimize synchronization strategies indata-parallel distributed training. AutoSync achieves high performance “out-of-thebox” – it navigates the space spanned by the proposed representation, and automaticallyidentifies synchronization strategies that report 1.2 - 1.6x speedups over existing hand-optimized systems, lowering the technical barrier of distributed ML and helping make it accessible to a larger community of users. Collectively, the set of techniques and systems developed in this thesis lead to the proof of the concept and the prototype implementation of an end-to-end compiler system for large-scale ML training on distributed environments.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.