Abstract

A novel network model “bidirectional depth residuals gated recurrent unit network (BDR-GRU) ” is designed and implemented for improving the effectiveness of Image Captioning. BDR-GRU is designed based on encoder and decoder architecture. Moreover, the network can run on an NVIDIA JETSON TX2 processor, which makes the algorithm applied to mobile robots. In the encoding stage, the convolution neural network is used to obtain the multi-dimensional vector information of the image, and the BDR-GRU network is used to complete the sentence generation in the decoding stage. The BDR-GRU network model is a new recurrent neural network model, which is improved on the basics of the GRU network. Firstly, the layer of the GRU network is increased from a single layer to multiple layers. Secondly, the bidirectional derivation structure is redesigned to enhance the ability of derivation. Finally, the residual mechanism between levels is designed to prevent the disappearance of gradient and over-fitting caused by the increase of the layers. Experiments are carried out on TX2 processor and have been done to verify the effectiveness of our design, and the results are compared with img-gLSTM network model, neural talk model, attention model, and unidirectional GRU model, then the results are analyzed. The experimental results show that the CIDEr evaluation value of our network model is 12.7% higher than that of the img-gLSTM network and 14.6% higher than that of the Neural Talk network, other evaluation indicators also improve significantly. The experimental results prove the significance of our BDR-GRU model.

Highlights

  • People use language to express their thoughts and describe what they see in life, but generating computer vision information into human descriptive languages is an extremely challenging subject, which combines image processing and language processing and other research directions

  • We choose the JETSON TX2 embedded processor developed by NVIDIA, which has powerful performance but small size, as the core processor of the experiments, which are shown in Fig. 7 and Fig. 8

  • We take the processor as the core of a mobile robot and the results of image captioning bring a better effect to the human-computer interaction of the mobile robot

Read more

Summary

Introduction

People use language to express their thoughts and describe what they see in life, but generating computer vision information into human descriptive languages is an extremely challenging subject, which combines image processing and language processing and other research directions. INDEX TERMS Computer vision, image captioning, deep neural network, BDR-GRU.

Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.