Spur: Mitigating Slow Instances in Large-Scale Streaming Pipelines

Ke Wang,Avrilia Floratou,Ashvin Agrawal,Daniel Musgrave

doi:10.1145/3318464.3386142

Abstract

Bing's monetization pipeline is one of the largest and most critical streaming workloads deployed in Microsoft's internal data lake. The pipeline runs 24/7 at a scale of 3500 YARN containers and is required to meet a Service Level Objective (SLO) of low tail latency. In this paper, we highlight some of the unique challenges imposed by this large scale of operation: other concurrent workloads sharing the cluster may cause random performance deterioration; unavailability of external dependencies may cause temporary stalls in the pipeline; scarcity in the underlying resource manager may cause arbitrarily long delays or rejection of container allocation requests. Weathering these challenges requires specially tailored dynamic control policies that react to these issues as and when they arise. We focus on the problem of reducing the latency in the tail, i.e., 99th percentile (p99), by detecting and mitigating slow instances through speculative replication. We show that widely used approaches do not satisfactorily solve this issue at our scale. A conservative approach is hesitant to acquire additional resources, reacts too slowly to the changes in the environment and therefore achieves little improvement in p99 latency. On the other hand, an aggressive approach overwhelms the underlying resource manager with unnecessary resource requests and paradoxically worsens the p99 latency. Our proposed approach, Spur, is designed for this challenging environment. It combines aggressive detection of slow instances with smart pruning of false positives to achieve a far better trade-off between these conflicting objectives. Using only 0.5% additional resources (similar to the conservative approach), we demonstrate a 10% -38% improvement in the tail latency compared to both conservative and aggressive approaches.

Full Text