Abstract

Due to the broad successes of deep learning, many CPU-centric artificial intelligent computing systems employ specialized devices such as GPUs, FPGAs, and ASICs, which can be named as Deep Learning Processing Units (DLPUs), for processing computation-intensive deep learning tasks. The separation between the scalar control operations mapped on CPUs and the vector computation operations mapped on DLPUs causes the frequent and costly interactions between CPUs and DLPUs, leading to the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Interaction Wall</i> . Moreover, the increasing algorithm complexity and DLPU computation speed would further aggravate the interaction wall substantially. To break the interaction wall, we propose a novel DLPU-centric deep learning computing system consisting of an <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">exception-oriented programming (EOP) model</i> and the architectural support of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">CPULESS DLPU</i> . The EOP model processes scalar control operations of a deep learning task as exception handlers to maximally avoid stalling the crucial and dominated vector computation operations. Together with the CPULESS DLPU which integrates a scalar processing unit (SPU) for scalar control operations and the parallel processing unit (PPU) for vector computation operations into a fused pipeline, the proposed DLPU-centric system can cost-effectively leverage the EOP model to execute the two kinds of operations simultaneously without disturbing each other. Compared with a state-of-the-art commodity CPU-centric system with discrete V100 GPU via PCIe bus, experimental results show that our DLPU-centric system achieves 10.30× better performance and 92.99 percent energy savings, respectively. Moreover, compared with a CPU-centric version of DLPU system where the SPU serves as the host with integrated PPU, the proposed DLPU-centric system still achieves 15.60 percent better performance from avoided interactions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call