Data flow graphs are a popular program representation in machine learning, big data analytics, signal processing, and, increasingly, networking, where graph nodes correspond to processing primitives and graph edges describe control flow. To improve CPU cache locality and exploit data-level parallelism, nodes usually process data in batches. Batchy is a scheduler for data flow graph based packet processing engines, which uses controlled queuing to reconstruct fragmented batches inside a data flow graph in accordance with strict Service-Level Objectives (SLOs). Earlier work showed that Batchy yields up to 10x performance improvement in real-life use cases, thanks to maximally exploiting batch processing gains. Batchy, however, is fundamentally restricted to single-threaded execution. In this paper, we generalize Batchy to parallel execution on multiple CPU cores. We extend the analytical model to the parallel setting and present a primal decomposition framework, where each core runs an unmodified Batchy controller to schedule batch-processing on a subset of the data flow graph, orchestrated by a master controller that distributes the delay-SLOs across the cores using subgradient search. Evaluations on a real software switch provide experimental evidence that our decomposition framework produces 2.5x performance improvement while accurately satisfying delay SLOs that are otherwise not feasible with single-core Batchy.
Read full abstract