Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. Several companies are now building large compute clusters reserved for DLRM training, driving new interest in cost- & time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning (DL) training jobs are dominated by model execution times, the most important factor in DLRM training performance is often online data ingestion. In this paper, we study real-world DLRM data processing pipelines taken from our compute cluster at Netflix to observe the performance impacts of online ingestion and identify shortfalls in existing data pipeline optimizers. Our studies lead us to design a new solution for data pipeline optimization, InTuneX . InTuneX is designed for production-scale multi-node recommender data pipelines. It unifies & tackles the challenges of both intra- and inter- node pipeline optimization. We achieve this with a multi-agent reinforcement learning (RL) design, simultaneously optimizing node assignments at the cluster level & CPU assignments within nodes. Our experiments show that InTuneX can build optimized data pipeline configurations within minutes. We apply InTuneX to our cluster, and find that it increases single-node data ingestion throughput by as much as 2.29X versus state-of-the-art optimizers, while improving the cost-efficiency of multi-node pipelines by 15-25%.
Read full abstract