Rich Visual and Language Representation with Complementary Semantics for Video Captioning

Pengjie Tang,Hanli Wang,Qinyu Li

doi:10.1145/3303083

Abstract

It is interesting and challenging to translate a video to natural description sentences based on the video content. In this work, an advanced framework is built to generate sentences with coherence and rich semantic expressions for video captioning. A long short term memory (LSTM) network with an improved factored way is first developed, which takes the inspiration of LSTM with a conventional factored way and a common practice to feed multi-modal features into LSTM at the first time step for visual description. Then, the incorporation of the LSTM network with the proposed improved factored way and un-factored way is exploited, and a voting strategy is utilized to predict candidate words. In addition, for robust and abstract visual and language representation, residuals are employed to enhance the gradient signals that are learned from the residual network (ResNet), and a deeper LSTM network is constructed. Furthermore, three convolutional neural network based features extracted from GoogLeNet, ResNet101, and ResNet152, are fused to catch more comprehensive and complementary visual information. Experiments are conducted on two benchmark datasets, including MSVD and MSR-VTT2016, and competitive performances are obtained by the proposed techniques as compared to other state-of-the-art methods.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: May 31, 2019
Citations: 16

Similar Papers

Richer Semantic Visual and Language Representation for Video Captioning
Pengjie Tang ... Hanzhang Wang
-
Pengjie Tang, et. al.Pengjie Tang ... Hanzhang Wang
23 Oct 2017
23 Oct 2017

Long short term memory network is capable of capturing complex hysteretic dynamics in piezoelectric actuators
Yanfang Liu ... Mingying Huo
Electronics Letters | VOL. 55
Yanfang Liu, et. al.Yanfang Liu ... Mingying Huo
01 Jan 2019
Electronics Letters | VOL. 55

Automated atrial fibrillation prediction using a hybrid long short-term memory network with enhanced whale optimization algorithm on electrocardiogram datasets
Revathithavamani Kalyanasundaram ... Sankar Sennan
International journal of noncommunicable diseases | VOL. 6
Revathithavamani Kalyanasundaram, et. al.Revathithavamani Kalyanasundaram ... Sankar Sennan
01 Jan 2020
International journal of noncommunicable diseases | VOL. 6

State of Health Estimation for Lithium-Ion Battery Based on Long Short Term Memory Networks
Zheng Chen ... Jiangwei Shen
DEStech Transactions on Environment, Energy and Earth Sciences | VOL. -
Zheng Chen, et. al.Zheng Chen ... Jiangwei Shen
04 Feb 2019
DEStech Transactions on Environment, Energy and Earth Sciences | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications