Abstract

Convolutional Neural Networks (CNNs) are popular models that have been successfully applied to diverse domains like vision, speech, and text. To reduce inference-time latency, it is common to employ hardware accelerators, which often require a model compression step. Contrary to most compression algorithms that are agnostic of the underlying hardware acceleration strategy, this paper introduces a novel Symmetric $k$ -means based compression algorithm that is specifically designed to support a new FPGA-based hardware acceleration scheme by reducing the number of inference-time multiply-accumulate (MAC) operations by up to 98%. First, a simple $k$ -means based training approach is presented and then as an extension, Symmetric $k$ -means is proposed which yields twice the reduction in MAC operations for the same bit-depth as the simple $k$ -means approach. A comparative analysis is conducted on popular CNN architectures for tasks including classification, object detection and end-to-end stereo matching on various datasets. For all tasks, the model compression down to 3 bits is presented, while no loss is observed in accuracy for the 5-bits quantization.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.