Multi-objective Magnitude-Based Pruning for Latency-Aware Deep Neural Network Compression

Wenjing Hong,Yiwen Wang,Ke Tang,Peng Yang

doi:10.1007/978-3-030-58112-1_32

Abstract

Layer-wise magnitude-based pruning is a popular method for Deep Neural Network (DNN) compression. It has the potential to reduce the latency for an inference made by a DNN by pruning connects in the network, which prompts the application of DNNs to tasks with real-time operation requirements, such as self-driving vehicles, video detection and tracking. However, previous methods mainly use the compression rate as a proxy for the latency, without explicitly accounting for latency in the training of the compressed network. This paper presents a new layer-wise magnitude-based pruning method, namely Multi-objective Magnitude-based Latency-Aware Pruning (MMLAP). MMLAP captures latency directly and incorporates a novel multi-objective evolutionary algorithm to optimize both accuracy of a DNN and its latency efficiency when designing compressed networks, i.e., when tuning hyper-parameters of LMP. Empirical studies show the competitiveness of MMLAP compared to well-established LMP methods and show the value of multi-objective optimization in yielding Pareto-optimal compressed networks in terms of accuracy and latency.

Full Text