Abstract

Autonomous highlight detection aims to identify the most captivating moments in a video, which is crucial for enhancing the efficiency of video editing and browsing on social media platforms. However, current efforts primarily focus on visual elements and often overlook other modalities, such as text information that could provide valuable semantic signals. To overcome this limitation, we propose a Multi-modal Contrastive Transformer for Video Highlight Detection (MCT-VHD). This transformer-based network mainly utilizes video and audio modalities, along with auxiliary text features (if exist) for video highlight detection. Specifically, We enhance the temporal connections within the video by integrating a convolution-based local enhancement module into the transformer blocks. Furthermore, we explore three multi-modal fusion strategies to improve highlight inference performance and employ a contrastive objective to facilitate interactions between different modalities. Comprehensive experiments conducted on three benchmark datasets validate the effectiveness of MCT-VHD, and our ablation studies provide valuable insights into its essential components.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call