Abstract
With DNN turning into the backbone of AI cloud services and propelling the emergence of INFerence-as-a-Service (INFaaS), DNN-specific accelerators have become the indispensable components of cloud inference systems. Due to the conservative “one-task-at-a-time” working mode and deadline blindness of those accelerators, implementing multi-tenancy that aims to improve the cost-effectiveness and meet SLA requirements is intractable. Recent studies including the temporal and spatial approaches, employ manifold scheduling mechanisms and sophisticated architecture innovations to address the challenge. However, these researches either still neglect the deadline awareness or render inevitable and expensive hardware overheads such as switches and storage. In this paper, we present <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Cooperative and Deadline-aware Multi-Systolic-Array scheduling</i> (CD-MSA), a low-cost solution for the cloud inference that utilizes the real time mechanism and task-level parallelism to enable efficient multi-tenancy. Based on our preemptive multi-systolic-array accelerator architecture supporting the simultaneous task co-location, we first construct a fine-grained DNN execution model to lay the groundwork for the lightweight preemption. Second, we design a cooperative, deadline- and laxity-aware scheduler in conjunction with an efficient schedulability test method for better QoS guarantee without introducing additional hardware cost. Finally, to further promote the overall throughput, we propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">dynamic task fusion</i> , a software approach that fuses different tasks into the logically “multi-threading” tasks at runtime. We compare CD-MSA with several state-of-the-art researches across three multi-DNN workloads. The evaluation results show CD-MSA improves the latency-bounded throughput, SLA satisfaction rate and weighted system throughput by up to 62%, 63% and 27%, respectively.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Parallel and Distributed Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.