Abstract

A thorough analysis and comprehension of the entire cue set in visual data are indispensable for an ideal video description model. As outlined in recent algorithm proposals, video descriptions have primarily been generated by learning RGB and optical flow representations rather than exploring and incorporating the media’s spectral components referring to the patterns or characteristics in the distribution of colors or intensities across different frequencies or wavelengths of light. These components may enhance the description quality and impact the generated text for accuracy, diversity, and coherence. We propose a novel Fourier-based algorithm for extracting spectral features in 3D visual volume by decomposing the video signal into its frequency components, to fill this research gap. Further, the captured spectral features are fused with learned spatial and temporal representations in recurrent transformer architecture for accurate content understanding and appropriate description generation in natural language. The transformer includes an external memory module that produces summarized memory states based on the history of previously observed video fragments and already-generated sentences. These memory states ensure the establishment of sound semantic and linguistic cues. As a result, our proposed algorithm integrates spatial, temporal, spectral, and semantic representations for precise and grammatically accurate descriptions. The effectiveness of our proposed algorithm for the coherent and diverse video description is demonstrated through qualitative and quantitative experimentation on the DeepRide driving trip description dataset. A comprehensive ablation study validates the efficacy of the spectral features fusion with spatial and temporal visual representations for the rich video-to-textual narration generation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call