Abstract
Deep Neural Networks (DNNs) have advanced the state-of-the-art in a variety of machine learning tasks and are deployed in increasing numbers of products and services. However, the computational requirements of training and evaluating large-scale DNNs are growing at a much faster pace than the capabilities of the underlying hardware platforms that they are executed upon. To address this challenge, one promising approach is to exploit the error resilient nature of DNNs by skipping or approximating computations that have negligible impact on classification accuracy. Almost all prior efforts in this direction propose static DNN approximations by either pruning network connections, implementing computations at lower precision, or compressing weights. In this work, we propose <u>Dy</u>namic <u>V</u>ariable <u>E</u>ffort <u>Deep</u> Neural Networks (D y VED eep ) to reduce the computational requirements of DNNs during inference. Complementary to the aforementioned static approaches, DyVEDeep is a dynamic approach that exploits heterogeneity in the DNN inputs to improve their compute efficiency with comparable classification accuracy and without requiring any re-training. D y VED eep equips DNNs with dynamic effort mechanisms that identify computations critical to classifying a given input and focus computational effort only on the critical computations, while skipping or approximating the rest. We propose three dynamic effort mechanisms that operate at different levels of granularity viz. neuron, feature, and layer levels. We build D y VED eep versions of six popular image recognition benchmarks (CIFAR-10, AlexNet, OverFeat, VGG-16, SqueezeNet, and Deep-Compressed-AlexNet) within the Caffe deep-learning framework. We evaluate D y VED eep on two platforms—a high-performance server with a 2.7 GHz Intel Xeon E5-2680 processor and 128 GB memory, and a low-power Raspberry Pi board with an ARM Cortex A53 processor and 1 GB memory. Across all benchmarks, D y VED eep achieves 2.47×--5.15× reduction in the number of scalar operations, which translates to 1.94×--2.23× and 1.46×--3.46× performance improvement over well-optimized baselines on the Xeon server and the Raspberry Pi, respectively, with comparable classification accuracy.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.