Abstract
This paper addresses the problem of event detection and localization in long football (soccer) videos. Our key idea is that understanding the long-range dependencies between video frames is imperative for accurate event localization in long football videos. Additionally, proper event detection is not likely for fast movements in football videos without considering mid-range and short-range correlations between neighboring video frames. We argue that event spotting can be considerably improved by considering short-range to long-range frame dependencies in a unified architecture. To model long-range and mid-range dependencies, we propose to use the dilated recurrent neural network (DilatedRNN) with long short-term memory (LSTM) units, grounded on two-stream convolutional neural network (Two-stream CNN) features. While two-stream CNN extracts local spatiotemporal features necessary for fine-level details, the DilatedRNN makes the information obtained from distant frames available for the classifier and spotting algorithms. Evaluating our event spotting algorithm on the largest publicly available benchmark football dataset –SoccerNet– shows an accuracy improvement of 0.8% - 13.6% compared to state of the art, and up to 30.1% accuracy gain in comparison to the baselines. We also investigate the contribution of each neural network component in spotting accuracy through an extensive ablation study.
Highlights
Sports video analysis has been an active area of research in the last few years [1]–[10]
We have proposed a new approach for event spotting in long football videos
The main hypothesis is that modeling short-range, mid-range, and long-range dependencies between video frames in long football videos should help with more accurate event spotting
Summary
Sports video analysis has been an active area of research in the last few years [1]–[10]. There has been a great demand for the use of video streaming services, sharing, and annotation platforms of sports videos. While automatic sports video analysis reduces the manual effort for content search in videos, due to practical limitations of current approaches, these cloud-based services provide limited functionality and tools for this purpose. Two important challenges in analyzing videos are 1) the localization of the key moments in a video, and 2) the classification of the localized key moments into certain event categories. While the former considers the temporal segmentation of a given video, the latter focuses on classifying the content of a short segment of the video.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.