Opportunities and Challenges for Resource Management and Machine Learning Clusters

Lydia Y Chen

doi:10.1145/3368235.3369376

Abstract

The practice of collecting big performance data has changed how infrastructure providers model and manage the system in the past decade. There is a methodology shift from domain-knowledge based white-box models, e.g., queueing [1] and simulation[2], to black-box data-driven models, e.g., machine learning. Such a game change for resource management from workload characterization[3], dependability prediction [4,5] to sprinting policy[6], can be seen from major IT infastructure providers, e.g., IBM and Google. While applying higher order deep neural networks show promises in predicting performance [4,5], the scalability of such an approach is often limited. A plethoral of prior work focus on deriving complex and highly accurate models, such as deep neural networks, overlooking the constraints of computation efficiency and the scalability. Their applicability on resource management problems of the production systems is thus hindered. A crucial aspect to derive accurate and scalable predictive performance models lies on leveraging the domain expertise, white-box models, and black-box models. Examples of scalable ticket management services from IBM [4] and predicting job failures [5] at Google. Model driven computation sprinting [6] dynamically scales the frequency and the allocation of computing cores based on grey box models which outperforms deep neural networks. Aforementioned case studies strongly argue for the importance of combing domain-driven and data-driven models At the same time, various of acceleration techniques are developed to reduce the computation overhead of (deep) machine learning models in small scale and isolated testbed. Managing the performance of clusters that are dominated by machine learning workloads remains challenging and calls for novel solutions. SlimML [9] accelerates the ML modeli training time by only processing critical data set at a slight cost of accuracy, whereas Dias [7] simultaneously explores the data dropping and frequency sprinting for ML clusters that support multiple priorities of different training workloads. Aforementioned studies point out the complexity of managing the accuracy-efficiency tradeoff of ML jobs in a cluster-like environment where jobs interfere each other via sharing the underlying resources and common data sets.

Full Text