Abstract

With DNN turning into the backbone of AI cloud services and propelling the emergence of INFerence-as-a-Service (INFaaS), DNN-specific accelerators have become the indispensable components of cloud inference systems. Due to the conservative “one-task-at-a-time” working mode and deadline blindness of those accelerators, implementing multi-tenancy that aims to improve the cost-effectiveness and meet SLA requirements is intractable. Recent studies including the temporal and spatial approaches, employ manifold scheduling mechanisms and sophisticated architecture innovations to address the challenge. However, these researches either still neglect the deadline awareness or render inevitable and expensive hardware overheads such as switches and storage. In this paper, we present <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Cooperative and Deadline-aware Multi-Systolic-Array scheduling</i> (CD-MSA), a low-cost solution for the cloud inference that utilizes the real time mechanism and task-level parallelism to enable efficient multi-tenancy. Based on our preemptive multi-systolic-array accelerator architecture supporting the simultaneous task co-location, we first construct a fine-grained DNN execution model to lay the groundwork for the lightweight preemption. Second, we design a cooperative, deadline- and laxity-aware scheduler in conjunction with an efficient schedulability test method for better QoS guarantee without introducing additional hardware cost. Finally, to further promote the overall throughput, we propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">dynamic task fusion</i> , a software approach that fuses different tasks into the logically “multi-threading” tasks at runtime. We compare CD-MSA with several state-of-the-art researches across three multi-DNN workloads. The evaluation results show CD-MSA improves the latency-bounded throughput, SLA satisfaction rate and weighted system throughput by up to 62%, 63% and 27%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call