Most large-scale ML implementations scale to large amounts of data by utilizing multiple servers or virtual machines (VMs) that iteratively compute model updates on local data that are periodically synchronized. Due to the complexity of managing the resulting computing infrastructure, many companies run their ML jobs on external cloud providers' servers. However, cloud resources can be expensive, particularly for large ML jobs with long runtimes. A particularly popular method to limit the costs of training ML jobs is to utilize preemptible cloud instances. These may be interrupted at the cloud provider's discretion, but they are significantly (up to 90%) cheaper than conventional on-demand instances. Most studies of these ML methods, however, assume the availability of large datasets at training time. In practice, training data may arrive at irregular intervals and models may be trained online as new data samples arrive, e.g., when monitoring data from IoT sensors. While some software frameworks like Apache Kafka can feed online data arrivals to ML algorithms, they provide little insight into the resulting costs of ML training. We extend prior work on provisioning preemptible instances to analyze available pools of data in order to run online ML on incoming datastreams, which presents new challenges due to the need to carefully handle data arrivals. We design, analyze, and optimize DOLL, which to the best of our knowledge is the first system that provides provable performance guarantees for Distributed OnLine Learning over preemptible instances. Research Challenges and Our Contributions: When pools of data are readily available, the bottleneck to distributed ML training often lies in the time required for each VM to compute its model updates. In our scenario, however, the arrival rate of incoming data may also bottleneck data processing. An intuitive strategy would then be for each VM to process each data point as it arrives. However, since arrivals at different VMs may not be coordinated, synchronizing the model parameters at each VM between data arrivals may introduce additional delays, while asynchronous SGD methods can lead to slow convergence [1]. DOLL uses a batching and grouping process to limit the synchronization delay, which naturally realizes traditional mini-batch SGD so as to provide provable model convergence guarantees. Handling online data arrivals becomes particularly challenging when we use preemptible instances to compute model updates. Existing methods utilizing preemptible instances for ML jobs largely focus on mitigating training interruptions [2] and their effects on model convergence [3]. When used on datastreams, we face an additional challenge of interruptions pausing the data arrival process, which impedes the rate at which we can compute model updates and thus model convergence. Thus, one should ensure that preemptions do not happen "too often," e.g., by computing some updates on on-demand instances. Our work is the first to optimize the number of preemptible VMs used and demonstrate that we can meet ML convergence guarantees.
Read full abstract