Compact and Computationally Efficient Representation of Deep Neural Networks.

Simon Wiedemann,Wojciech Samek,Klaus-Robert Muller

doi:10.1109/tnnls.2019.2910073

Abstract

At the core of any inference procedure, deep neural networks are dot product operations, which are the component that requires the highest computational resources. For instance, deep neural networks, such as VGG-16, require up to 15-G operations in order to perform the dot products present in a single forward pass, which results in significant energy consumption and thus limits their use in resource-limited environments, e.g., on embedded devices or smartphones. One common approach to reduce the complexity of the inference is to prune and quantize the weight matrices of the neural network. Usually, this results in matrices whose entropy values are low, as measured relative to the empirical probability mass distribution of its elements. In order to efficiently exploit such matrices, one usually relies on, inter alia, sparse matrix representations. However, most of these common matrix storage formats make strong statistical assumptions about the distribution of the elements; therefore, cannot efficiently represent the entire set of matrices that exhibit low-entropy statistics (thus, the entire set of compressed neural network weight matrices). In this paper, we address this issue and present new efficient representations for matrices with low-entropy statistics. Alike sparse matrix data structures, these formats exploit the statistical properties of the data in order to reduce the size and execution complexity. Moreover, we show that the proposed data structures can not only be regarded as a generalization of sparse formats but are also more energy and time efficient under practically relevant assumptions. Finally, we test the storage requirements and execution performance of the proposed formats on compressed neural networks and compare them to dense and sparse representations. We experimentally show that we are able to attain up to ×42 compression ratios, ×5 speed ups, and ×90 energy savings when we lossless convert the state-of-the-art networks, such as AlexNet, VGG-16, ResNet152, and DenseNet, into the new data structures and benchmark their respective dot product.

Highlights

T HE dot product operation between matrices constitutes one of the core operations in almost any field in science
Deep neural networks heavily rely on dot product operations in their inference [4], e.g., networks, such as VGG-16, require up to 16 dot product operations, which results in 15-G operations for a single forward pass
2) We provide a detailed analysis of the storage requirements and algorithmic complexity of performing the dot product associated with these data structures

Summary

Introduction

T HE dot product operation between matrices constitutes one of the core operations in almost any field in science. Since the complexity depends on the data structure used for representing the elements of the matrices, a great amount of research has focused on designing data structures and respective algorithms that can perform efficient dot product operations [5]–[7]. One can design efficient representations of sparse matrices by leveraging the prior assumption that most of their element values are zero and, only store the nonzero entries of the matrix. Their storage requirements become of the order of the number of nonzero values. Sparsity can be a too constrained assumption for the representation of quantized neural networks

Objectives

Methods

Findings

Conclusion