Video action recognition is an important research direction in the field of computer vision and pattern recognition, with extensive applications in intelligent video surveillance, human-computer interaction, and sports analysis. The development of data storage and computing hardware over the past decade has driven a shift from traditional feature extraction and machine learning algorithms to deep learning-based approaches. This paper reviews the current state of development, problems, and future research directions of video action recognition techniques. Traditional methods are gradually being replaced by deep learning methods such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long-short-term memory networks (LSTMs). These methods automatically extract features and handle time-dependency, significantly improving the accuracy and robustness of action recognition. In particular, models based on the attention mechanism further enhance action recognition performance by dynamically adjusting the focus of attention, a current hot spot in research. Despite many advances, video action recognition still faces several challenges, including high computational resource requirements, complex model training, dataset bias issues, and variations in real-world application scenarios such as viewpoint changes, lighting changes, and occlusion. Future research can explore multi-modal fusion, lightweight models, self-supervised learning, and cross-domain transfer learning to improve the accuracy, robustness, and generalization of action recognition. The review provided aims to offer researchers a comprehensive perspective on the current state of development and future research directions of video action recognition technology.
Read full abstract