Deep Learning Inferencing with High-performance Hardware Accelerators

Luke Kljucaric,Alan D George

doi:10.1145/3594221

Luke Kljucaric, Alan D George

Open Access

PDF Available

https://doi.org/10.1145/3594221

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many apps, like real-time video processing, are focused on latency of computations rather than strictly on throughput. This research analyzes multiple compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency and maximum throughput for optical character recognition. Considering that these models are composed of fundamental neural network operations yet architecturally different from each other, these models can stress devices in different yet insightful ways that generalizations of the performance of other models can be drawn from. Many devices featuring ML-specific hardware and optimizations are analyzed including Intel and AMD CPUs, Xilinx and Intel FPGAs, NVIDIA GPUs, and Google TPUs. Overall, ML-oriented hardware added to the Intel Xeon CPUs helps to boost throughput by 3.7× and to reduce latency by up to 34.7×, which makes the latency of Intel Xeon CPUs competitive on more parallel models. The TPU devices were limited in terms of throughput due to large data transfer times and not competitive in terms of latency. The FPGA frameworks showcase the lowest latency on the Xilinx Alveo U200 FPGA achieving 0.48 ms on AlexNet using Mipsology Zebra and 0.39 ms on GoogLeNet using Vitis-AI. Through their custom acceleration datapaths coupled with high-performance SRAM, the FPGAs are able to keep critical model data closer to processing elements for lower latency. The massively parallel and high-memory GPU devices with Tensor Core accelerators achieve the best throughput. The NVIDIA Tesla A100 GPU showcases the highest throughput at 42,513 and 52,484 images/second for AlexNet and GoogLeNet, respectively. 1

Full Text