Abstract
Convolutional neural networks (CNNs) require significant computing power during inference. Smart phones, for example, may not run a facial recognition system or search algorithm smoothly due to the lack of resources and supporting hardware. Methods for reducing memory size and increasing execution speed have been explored, but choosing effective techniques for an application requires extensive knowledge of the network architecture. This paper proposes a general approach to preparing a compressed deep neural network processor for inference with minimal additions to existing microprocessor hardware. To show the benefits to the proposed approach, an example CNN for synthetic aperture radar target classification is modified and complimentary custom processor instructions are designed. The modified CNN is examined to show the effects of the modifications and the custom processor instructions are profiled to illustrate the potential performance increase from the new extended instructions.
Highlights
Convolutional neural networks (CNNs) have become increasingly popular for image classification and a variety of other machine learning tasks
They are more efficient than other classifier types that can be trained with large datasets, CNNs are still computationally intensive applications
New processor instructions to calculate these layers efficiently alongside a Single Instruction Multiple Data (SIMD) Multiply and Accumulate (MAC) instruction for fully connected layers and 1 × 1 convolution can cover all basic layers of a modern CNN and enable fast, low-power inference
Summary
Convolutional neural networks (CNNs) have become increasingly popular for image classification and a variety of other machine learning tasks. In addition to computation requirements, memory access penalties significantly impact overall execution time and power consumption. Converting to smaller bit-width representations of weights and data in the middle layers of a CNN drastically reduces the number of memory accesses and increases execution speedup in real systems. For the same chip area, multiple small fixed-point multipliers increase the computational throughput for convolution and fully connected layers. GPUs are the preferred method of training and running CNNs in research because they hide memory access penalties by compensating for image throughput. New processor instructions to calculate these layers efficiently alongside a SIMD MAC instruction for fully connected layers and 1 × 1 convolution can cover all basic layers of a modern CNN and enable fast, low-power inference. Computational speed increase and gate counts of the custom instructions provide a basis for the viability of the proposed study in a real application-specific instruction set processor (ASIP) application
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.