Abstract
Convolutional neural networks (CNNs) and Transformer architectures have traditionally been recognized as the preferred models for addressing computer vision tasks. However, there has been a recent surge in the popularity of networks based on multi-layer perceptron (MLP) structures that do not rely on convolution or attention mechanisms. These MLP architectures have demonstrated exceptional performance in image classification tasks, exhibiting lower time complexity while maintaining high accuracy. In contrast, video classification tasks involve larger amounts of data and necessitate more intricate feature extraction, resulting in increased time and resource consumption. To enhance computational efficiency and minimize resource utilization, we propose a convolution-free and Transformer-free architecture for video classification called Video-MLP for video classification. Video-MLP utilizes a simple MLP structure to learn video features. Specifically, it comprises two types of layers: Spatial-Mixer and Temporal-Mixer, which respectively capture spatial and temporal information. The Spatial-Mixer extracts spatial information from each frame along the height and width dimensions, while the Temporal-Mixer models temporal information for the same spatial positions across frames. To improve the efficiency of spatial-temporal modeling in our network, we used a spatial-temporal information fusion approach to integrate information at different scales. Additionally, we grouped the input data along the time dimension and designed three different grouping schemes when extracting temporal information. The experimental results indicate that Video-MLP achieved accuracy rates of 87.2% on the Kinetics-400 dataset and 75.3% on the SomethingV2 dataset, outperforming models with equivalent computational complexity. Notably, Video-MLP achieved these results without using convolution and attention mechanisms, and without pre-training on large-scale image and video datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.