Kernel Quantization for Efficient Network Compression

Zhongzhi Yu,Yemin Shi

doi:10.1109/access.2022.3140773

Zhongzhi Yu, Yemin Shi

Open Access

https://doi.org/10.1109/access.2022.3140773

Copy DOI

Abstract

This paper presents a novel network compression framework, Kernel Quantization (KQ), targeting to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss. Unlike existing methods struggling with weight bit-length, KQ has the potential in improving the compression ratio by considering the convolution kernel as the quantization unit. Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level. Instead of representing each weight parameter with a low-bit index, we learn a kernel codebook and replace all kernels in the convolution layer with corresponding low-bit indexes. Thus, KQ can represent the weight tensor in the convolution layer with low-bit indexes and a kernel codebook with limited size, which enables KQ to achieve significant compression ratio. Then, we conduct a 6-bit parameter quantization on the kernel codebook to further reduce redundancy. Extensive experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer and achieves the state-of-the-art compression ratio with little accuracy loss.

Highlights

In recent years, deep convolutional neural networks (CNNs) have achieved astonishing success in a variety range of computer vision tasks, such as image classification [1]–[3], semantic segmentation [4], action recognition [5], video restoration [6], [7] and Computer-aided Diagnosis [8], [9]
The promising results of CNNs are mainly attributed to the massive learnable parameters, which benefit from abundant annotated data and computing platform improvement
The proposed method is applied on popular CNN architectures and achieves significant compression ratio while having better accuracy compared to conventional network quantization methods

Summary

INTRODUCTION

Deep convolutional neural networks (CNNs) have achieved astonishing success in a variety range of computer vision tasks, such as image classification [1]–[3], semantic segmentation [4], action recognition [5], video restoration [6], [7] and Computer-aided Diagnosis [8], [9]. Network quantization still needs at least one bit to represent each parameter, leading to the theoretical compression ratio limit of 32 times. We propose to apply 6-bit quantization to the kernel codebook to further compress the model preserve the variety of kernels With these two steps, we are able to significantly compress the CNN without the limitation of theoretical compression ratio and achieve comparable accuracy to the fullprecision model. The proposed method is applied on popular CNN architectures and achieves significant compression ratio (on average 1.05 and 1.62 bits for VGG and ResNet to represent each parameter in the convolution layer, respectively) while having better accuracy compared to conventional network quantization methods.

RELATED WORKS

KERNEL-LEVEL QUANTIZATION

EXPERIMENTS

Findings

ABLATION STUDIES