Abstract

The team sports game video features complex background, fast target movement, and mutual occlusion between targets, which poses great challenges to multiperson collaborative video analysis. This paper proposes a video semantic extraction method that integrates domain knowledge and in-depth features, which can be applied to the analysis of a multiperson collaborative basketball game video, where the semantic event is modeled as an adversarial relationship between two teams of players. We first designed a scheme that combines a dual-stream network and learnable spatiotemporal feature aggregation, which can be used for end-to-end training of video semantic extraction to bridge the gap between low-level features and high-level semantic events. Then, an algorithm based on the knowledge from different video sources is proposed to extract the action semantics. The algorithm gathers local convolutional features in the entire space-time range, which can be used to track the ball/shooter/hoop to realize automatic semantic extraction of basketball game videos. Experiments show that the scheme proposed in this paper can effectively identify the four categories of short, medium, long, free throw, and scoring events and the semantics of athletes’ actions based on the video footage of the basketball game.

Highlights

  • In recent years, the spectacle and attention of various sports events have been increasing, and a large number of sports events are broadcasted and shared on the Internet in the form of videos, and sports videos have become an efficient and irreplaceable force for information dissemination on the Internet [1]

  • E traditional method based on the manual annotation to categorize and integrate the events occurring in a video requires a lot of human resources and has a high error rate because manual annotation is influenced by human subjective factors and cannot meet the various needs of different people [2]

  • E contributions of this paper are as follows: (1) We first designed a scheme that combines dualstream network and learnable spatiotemporal feature aggregation, which can be used for end-to-end training of video semantic extraction to bridge the gap between low-level features and high-level semantic events

Read more

Summary

Introduction

The spectacle and attention of various sports events have been increasing, and a large number of sports events are broadcasted and shared on the Internet in the form of videos, and sports videos have become an efficient and irreplaceable force for information dissemination on the Internet [1]. Basketball video events are classified by combining sports domain knowledge, and secondly, global and group motion patterns of players in basketball games are expressed, in which the spatiotemporal extension of NetVLAD [8] aggregation layer in deep learning [9] is used, and NetVLAD can be used well in still images for instance-level recognition tasks. E study in [27] implements event detection based on CNN extended network by introducing cascaded CNN networks to express local motion information while incorporating trajectory information to achieve event analysis of surveillance videos; similar to action recognition, event analysis can be implemented based on a dual-stream CNN model, which introduces both spatial information and temporal features from a global perspective. In [29], firstly, player tracking is achieved based on keyframes, secondly, player state changes are drawn by LSTM, and player information is fused by different kinds of team division methods to achieve event representation

The Proposed Dual-Stream Network
Simulation Experiment and Result Analysis
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call