Abstract

In order to adapt the application demands of high resolution images recognition and efficient processing of localization in aviation and aerospace fields, and to solve the problem of insufficient parallelism in existing researches, an extensible multiprocessor cluster deep learning processor architecture based on VLIW is designed by optimizing the computation of each layer of deep convolutional neural network model. Parallel processing of feature maps and neurons, instruction level parallelism based on very long instruction word (VLIW), data level parallelism of multiprocessor clusters and pipeline technologies are adopted in the design. The test results based on FPGA prototype system show that the processor can effectively complete the image classification and object detection applications. The peak performance of processor is up to 128 GOP/s when it operates at 200 MHz. For selecting benchmarks, the processor speed is about 12X faster than CPU and 7X faster than GPU at least. Comparing with the results of the software framework, the average error of the test accuracy of the processor is less than 1%.

Highlights

  • In order to adapt the application demands of high resolution images recognition and efficient processing of localization in aviation and aerospace fields, and to solve the problem of insufficient parallelism in existing re⁃ searches, an extensible multiprocessor cluster deep learning processor architecture based on VLIW is designed by optimizing the computation of each layer of deep convolutional neural network model

  • Parallel processing of feature maps and neurons, instruction level parallelism based on very long instruction word ( VLIW), data level parallelism of multiprocessor clusters and pipeline technologies are adopted in the design

  • The test results based on FPGA pro⁃ totype system show that the processor can effectively complete the image classification and object detection applica⁃ tions

Read more

Summary

Introduction

用线下训练方式获取参数。 训练过程中采用当前业 界流行的深度学习框架 Caffe[12] ,硬件环境包括了 CPU( Core i7,6700HQ) 和 GPU ( GTX960M) 。 测试 基准采用了网络结构修改过的 LeNet⁃5[1⁃2] 和 Alex⁃ Net[1,3] , MobileNet[5] 和 SSD300 + MobileNet[4⁃5] 等 深 度卷积神经网络模型,训练及测试数据集分别采用 了 MNIST[2] , CIFAR⁃10[13] , Stanford Dogs[14] , PASCAL VOC2007[15] 和 VOC2012[15] 等。 训练完成 后从得到的 Caffemodel 模型中提取神经网络的参数,经过预处理后用于处理器的计算。 将要部署的 深度神经网络模型 Prototxt 文件通过软件编译器映 射到处理器,产生处理器运行的 VLIW 指令序列。 在图像 分类测试中, 采 用 了 LeNet⁃5[1⁃2] 、 Alex⁃ Net[1,3] 和 MobileNet[5] 作为测试基准, 分 别 在 MNIST[2] 、CIFAR⁃10[13] 和 Stanford Dogs[14] 数据集上 进行测试,取得的测试精度与软件框架 Caffe[12] 测 试的精度对比如表 2 所示。 在图像分类试验过程中,通过 Caffe[12] 的计时 功能测得了测试基准在相应数据集上处理一副图像 分别采用硬件环境中 CPU 和 GPU 所占用的时间, 并通过仿真获得了深度学习 VLIW 处理器的运行时 间,其对比如表 3 所示。

Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.