Abstract

Intelligent vehicle driving systems aim to control the driving behavior of a vehicle in real time without human intervention by perceiving and monitoring the surrounding environment. Describing images of traffic scenes automatically, which is one of the key problems of intelligent vehicle driving technology, has drawn attention since its inception. In recent years, a variety of automatic image description technologies have been proposed, among which the attention-based encoder-decoder framework achieved good results. In this paper we will discuss the fusing of a variety of information from multiple aspects of the images of traffic scenes. First, we will introduce visual attention, text attention and image topics attention which generates the weighted visual features, the attentive text information and the global image topics information respectively. We will then propose an adaptive two-stage merging network based on an encoder-decoder framework, which can fully integrate the three kinds of information in two stages, while automatically calculating the proportions of the information at each time step. Numerous experiments conducted on COCO2014 and Flickr30K datasets have demonstrated the effectiveness and advantages of the proposed method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.