Abstract

The dominant video captioning methods employ the attentional encoder–decoder architecture, where the decoder is an autoregressive structure that generates sentences from left-to-right. However, these methods generally suffer from the exposure bias issue and neglect the guidance of future output contexts obtained from the right-to-left decoding. Here, the authors propose a new symmetric bidirectional decoder for video captioning. The authors first integrate the self-attentive multi-head attention and bidirectional gated recurrent unit for capturing the long-term semantic dependencies in videos. The authors then apply one single decoder to generate accurate descriptions from left-to-right and right-to-left simultaneously. The decoder in each decoding direction performs two cross-attentive multi-head attention modules to consider both the past hidden states from the same decoding direction and the future hidden states from the reverse decoding direction at each time step. A symmetric semantic-guided gated attention module is specially devised to adaptively suppress the irrelevant or misleading contents in the past or future output contexts and retain the useful ones for avoiding under-description. Experimental evaluations on two widely applied benchmark datasets: Microsoft research video to text and Microsoft video description corpus, demonstrate that the authors' proposed method obtains substantially state-of-the-art performance, which validates the superiority of the bidirectional decoder.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.