Abstract

Graphics Processing Units (GPUs) have evolved as powerful co-processors for the CNN training. Many new features have been introduced into GPUs such as concurrent kernel execution and hyper-Q technology. It is challenging to orchestrate concurrency for CNN (convolutional neural networks) training on GPUs since it may introduce synchronization overhead and poor resource utilization. Unlike previous research which mainly focuses on single layer or coarse-grained optimization, we introduce a critical-path based, asynchronous parallelization mechanism, and propose the optimization technique for the CNN training that takes into account global network architecture and GPU resource usage together. The proposed methods can effectively overlap the synchronization and the computation in different streams. As a result, the training process of CNN is accelerated. We have integrated our methods into Caffe. The experimental results show that the Caffe integrated with our methods can achieve 1.30X performance speedup on average compared with Caffe+cuDNN, and even higher performance speedup can be achieved for deeper, wider, and more complicated networks.

Highlights

  • DEEP neural networks (DNN) have been widely applied for solving problems in many practical fields such as image classification, object detection, speech recognition, and language translation

  • We can use the following mechanism to further improve the performance of running TurboDL in the multi-Graphics Processing Units (GPUs) setting

  • We mainly utilize CNN as the example to show the effectiveness of our methods, our methods have an excellent potential to be applied to other complicated multi-stage applications, such as database query processing, and other network architectures, such as RNN, tree neural network, generative adversarial network, Graph-based Convolutional Neural network (GCN)

Read more

Summary

Introduction

DEEP neural networks (DNN) have been widely applied for solving problems in many practical fields such as image classification, object detection, speech recognition, and language translation. Since training deep neural networks is a very time and resource consuming task, generalpurpose graphics processing units (GPUs) are often used to accelerate the neural network training process. It should be noted that the existing platforms are optimized for current GPUs, they may need to be revised as the GPU architectures evolve in order to make efficient use of the added features in new architectures and retain good performance. This type of re-optimization is a non-trivial task.

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call