Abstract

This paper presents a transformer-based multimodal soccer scene recognition method for both visual and audio modalities. Our approach directly uses the original video frames and audio spectrogram from the soccer video as the input of the transformer model, which can capture the spatial information of the action at a moment and the contextual temporal information between different actions in the soccer videos. We fuse both video frames and audio spectrogram information output from the transformer model in order to better identify scenes that occur in real soccer matches. The late fusion performs a weighted average of visual and audio estimation results to obtain complete information of a soccer scene. We evaluate the proposed method on SoccerNet-V2 dataset and confirm that our method achieves the best performance compared with the recent and state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call