Abstract
Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. Several companies are now building large compute clusters reserved for DLRM training, driving new interest in cost- & time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning (DL) training jobs are dominated by model execution times, the most important factor in DLRM training performance is often online data ingestion. In this paper, we study real-world DLRM data processing pipelines taken from our compute cluster at Netflix to observe the performance impacts of online ingestion and identify shortfalls in existing data pipeline optimizers. Our studies lead us to design a new solution for data pipeline optimization, InTuneX . InTuneX is designed for production-scale multi-node recommender data pipelines. It unifies & tackles the challenges of both intra- and inter- node pipeline optimization. We achieve this with a multi-agent reinforcement learning (RL) design, simultaneously optimizing node assignments at the cluster level & CPU assignments within nodes. Our experiments show that InTuneX can build optimized data pipeline configurations within minutes. We apply InTuneX to our cluster, and find that it increases single-node data ingestion throughput by as much as 2.29X versus state-of-the-art optimizers, while improving the cost-efficiency of multi-node pipelines by 15-25%.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.