Conducting federated learning across distributed sites with In-Band Network Telemetry (INT) based data collection faces critical challenges, including control decisions of different frequencies, convergence of the models being trained, and resource provisioning coupled over time. To study this problem, we formulate a non-linear mixed-integer program to optimize the long-term INT overhead, resource cost, and federated learning cost. We then design polynomial-time online algorithms to solve this problem with only observable inputs on the fly, featuring laziness-aware resource adaption, online-learning-based INT flow selection and model aggregation control, as well as expectation-preserving randomized dependent rounding. We rigorously prove the parameterized-constant competitive ratio of our approach against the offline optimum, and the time-averaged constraint violation that vanishes in the long run. With extensive trace-driven evaluations, we confirm the superiority of our approach over other alternative approaches for reducing total cost and the efficacy of our trained models for solving real machine learning problems, reducing the real-time cost by <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$34\%$</tex-math> </inline-formula> on average.
Read full abstract