Abstract

This paper describes our implementation of several neural networks built on a field programmable gate array (FPGA) and used to recognize a handwritten digit dataset—the Modified National Institute of Standards and Technology (MNIST) database. We also propose a novel hardware-friendly activation function called the dynamic Rectifid Linear Unit (ReLU)—D-ReLU function that achieves higher performance than traditional activation functions at no cost to accuracy. We built a 2-layer online training multilayer perceptron (MLP) neural network on an FPGA with varying data width. Reducing the data width from 8 to 4 bits only reduces prediction accuracy by 11%, but the FPGA area decreases by 41%. Compared to networks that use the sigmoid functions, our proposed D-ReLU function uses 24% - 41% less area with no loss to prediction accuracy. Further reducing the data width of the 3-layer networks from 8 to 4 bits, the prediction accuracies only decrease by 3% - 5%, with area being reduced by 9% - 28%. Moreover, FPGA solutions have 29 times faster execution time, even despite running at a 60× lower clock rate. Thus, FPGA implementations of neural networks offer a high-performance, low power alternative to traditional software methods, and our novel D-ReLU activation function offers additional improvements to performance and power saving.

Highlights

  • Machine learning and deep learning algorithms and their applications are becoming increasingly prevalent

  • The 8-bit hardware design of the 2-layer online training multilayer perceptron (MLP) neural network performs with similar execution time (3.8 seconds) and prediction accuracy (89%) as the 32-bit software solution running at a clock speed 144 times greater than the hardware design (3.6 GHz vs. 25 MHz)

  • The D-Rectified Linear Unit (ReLU) activation function proposed in this paper offers a more flexible and accurate algorithm than the traditional ReLU function

Read more

Summary

Introduction

Machine learning and deep learning algorithms and their applications are becoming increasingly prevalent. While these algorithms enhance system intelligence, they have the disadvantages of being compute-intensive and de-. Several recent machine learning programs combine software algorithms with hardware specially designed for such algorithms. A TPU is a domain-specific custom application specific integrated circuit (ASIC) designed to implement machine learning algorithms. The TPU is 15 - 30 times faster and 30 - 80 times more power efficient than modern GPUs and CPUs when running machine learning algorithms

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call