Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

Philipp Harzig,Moritz Einfalt,Rainer Lienhart

doi:10.1109/icip46576.2022.9897804

Abstract

Video-to-text (VTT) is the task of automatically generating descriptions for short audio-visual video clips. It can help visually impaired people to understand scenes shown in a YouTube video, for example. Transformer architectures have shown great performance in both machine translation and image captioning. In this work, we transfer promising approaches from image captioning and video processing to VTT and develop a straightforward Transformer architecture. Then, we expand this Transformer by a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments on the VATEX dataset and improve the CIDEr and BLEU-4 scores by 21.72 and 8.38 points compared to a vanilla Transformer network and achieve state-of-the art results on the MSR-VTT and MSVD datasets. Also, our novel FPE helps increase the CIDEr score by relative 8.6 %.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Improving German Image Captions Using Machine Translation and Transfer Learning
Rajarshi Biswas ... Mareike Hartmann
-
Rajarshi Biswas, et. al.Rajarshi Biswas ... Mareike Hartmann
01 Jan 2020
01 Jan 2020

External knowledge-assisted Transformer for image captioning
Zhixin Li ... Tianyu Chen
Image and Vision Computing | VOL. 140
Zhixin Li, et. al.Zhixin Li ... Tianyu Chen
07 Nov 2023
Image and Vision Computing | VOL. 140

Adaptive Attention Generation for Indonesian Image Captioning
Made Raharja Surya Mahadi ... Anditya Arifianto
-
Made Raharja Surya Mahadi, et. al.Made Raharja Surya Mahadi ... Anditya Arifianto
01 Jun 2020
01 Jun 2020

Thai Scene Graph Generation from Images and Applications
Panida Khuphiran ... Supasit Kajkamhaeng
-
Panida Khuphiran, et. al.Panida Khuphiran ... Supasit Kajkamhaeng
01 Oct 2019
01 Oct 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

Abstract

Talk to us

Similar Papers