Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference.

Benjamin Hawks,Nhan Tran,Javier Duarte,Yaman Umuroglu,Nicholas J Fraser,Alessandro Pappalardo

doi:10.3389/frai.2021.676564

Abstract

Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantization-aware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.

Highlights

Efficient implementations of machine learning (ML) algorithms provide a number of advantages for data processing both on edge devices and at massive data centers
In this study, we explored efficient neural network (NN) implementations by coupling pruning and quantization at training time
This demonstrates that, for our task, pruning and QAT are complementary and can be used in concert

Summary

Introduction

Efficient implementations of machine learning (ML) algorithms provide a number of advantages for data processing both on edge devices and at massive data centers. These include reducing the latency of neural network (NN) inference, increasing the throughput, and reducing power consumption or other hardware resources like memory. During the ML algorithm design stage, the computational burden of NN inference can be reduced by eliminating nonessential calculations through a modified training procedure. We study efficient NN design for an ultra-low latency, resourceconstrained particle physics application. The classification task is to identify radiation patterns that arise from different elementary particles at sub-microsecond latency. While our application domain emphasizes low latency, the generic techniques we develop are broadly applicable

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Artificial Intelligence	Publication Date: Jul 9, 2021
Citations: 23	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Artificial Intelligence

Lead the way for us

Similar Papers

A Review of Information Content Metric for Semantic Similarity
Lingling Meng ... Junzhong Gu
-
Lingling Meng, et. al.Lingling Meng ... Junzhong Gu
01 Jan 2012
01 Jan 2012

A new semantic relatedness measurement using WordNet features
Mohamed Ali Hadj Taieb ... Abdelmajid Ben Hamadou
Knowledge and Information Systems | VOL. 41
Mohamed Ali Hadj Taieb, et. al.Mohamed Ali Hadj Taieb ... Abdelmajid Ben Hamadou
13 Aug 2013
Knowledge and Information Systems | VOL. 41

Resistance Drift-Reduced Multilevel Storage and Neural Network Computing in Chalcogenide Phase Change Memories by Bipolar Operation
Xin Li ... Qiang He
IEEE Electron Device Letters | VOL. 43
Xin Li, et. al.Xin Li ... Qiang He
01 Apr 2022
IEEE Electron Device Letters | VOL. 43

Research on group behavior model based on neural network computing
Jinfeng Wei ... Yuan Tian
Computational Intelligence | VOL. 38
Jinfeng Wei, et. al.Jinfeng Wei ... Yuan Tian
29 Sep 2020
Computational Intelligence | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Artificial Intelligence