Abstract

The blowout development of video social platforms has spawned a wide range of social video prediction (SVP) tasks, such as video attractiveness prediction and video sentiment classification. In this paper, we propose to enhance SVP by making synchronous predictions based on temporal and spatial data perspectives and reconciling them to form a consistent predictive view. To this end, we develop a novel multimodal deep learning method named MATSC (modality-awareness- and temporal-spatial-consistency-based neural network). Specifically, MATSC first constructs the temporal predictive view by capturing valuable fine-grained data patterns and generating diverse multimodal representations via the modality-awareness learning strategy. Then, MATSC constructs the spatial predictive view by exploiting diverse modality-wise interactive patterns in fine-grained video clips. Third, MATSC reconciles the heterogeneous temporal and spatial predictive capabilities via a temporal-spatial-consistency learning objective. Empirical results based on three SVP datasets show the outperformance of MATSC over state-of-the-art benchmarks, demonstrating the enhancement effect of synergizing temporal and spatial data views for SVP tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call