Abstract

Convolutional Neural Networks (CNNs) have become the most advanced algorithms for deep learning. They are widely used in image processing, object detection and automatic translation. As the demand for CNNs continues to increase, the platforms on which they are deployed continue to expand. As an excellent low-power, high-performance, embedded solution, Digital Signal Processor (DSP) is used frequently in many key areas. This paper attempts to deploy the CNN to Texas Instruments (TI)’s TMS320C6678 multi-core DSP and optimize the main operations (convolution) to accommodate the DSP structure. The efficiency of the improved convolution operation has increased by tens of times.

Highlights

  • In recent years, deep learning has emerged in the rapid development of computer hardware, and has become a popular technology

  • We already know that the most time-consuming layer of Convolutional Neural Networks (CNNs) is convolutional layer, and the whole network is streamlined, dependent on the front, not independent, so it is more suitable for the master-slave parallel model

  • In the aspect of multi-core communication, IPC is commonly used for communication, but we find that the time of interruption event is large, especially for small convolutional layers

Read more

Summary

Introduction

Deep learning has emerged in the rapid development of computer hardware, and has become a popular technology. Because CNNs use a large number of convolutional layers, the computational complexity of the network is increased, and a huge workload is brought about. We often use GPUs to accelerate the training and inference processes of CNNs. As technology continues to develop and application needs, the overhead of GPU area and power consumption becomes unbearable, so deploying CNNs on mobile and embedded platforms is beginning, such as FPGAs and ASICs. Digital signal processors (DSPs) are known for their low power consumption, high computing power, and ease of programming. This paper uses Texas Instruments (TI)'s TMS320C6678 for the deployment and optimization of CNNs. The TMS320C6678 is a multi-core DSP based on TI's Keystone architecture. The TMS320C6678 is a multi-core DSP based on TI's Keystone architecture It has 8 C66x cores and supports 1GHz or 1.25GHz frequency, up to 320GMAC or 160GFLOPs, and consumes only 10W. The perfect integration of multiple C66x DSP cores creates a multi-core Systemon-Chip (SoC) device with superior performance

Convolution method
Sliding windows
D5 D7 D8
FFT algorithm
CNN architecture
Deployment
Parallel
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call