Video-MLP: Convolution-free, attention-free architecture for video classification

Qian Zhang,Enrui Bai,Mingwen Shao,Hong Liang

doi:10.3233/jifs-240310

Abstract

Convolutional neural networks (CNNs) and Transformer architectures have traditionally been recognized as the preferred models for addressing computer vision tasks. However, there has been a recent surge in the popularity of networks based on multi-layer perceptron (MLP) structures that do not rely on convolution or attention mechanisms. These MLP architectures have demonstrated exceptional performance in image classification tasks, exhibiting lower time complexity while maintaining high accuracy. In contrast, video classification tasks involve larger amounts of data and necessitate more intricate feature extraction, resulting in increased time and resource consumption. To enhance computational efficiency and minimize resource utilization, we propose a convolution-free and Transformer-free architecture for video classification called Video-MLP for video classification. Video-MLP utilizes a simple MLP structure to learn video features. Specifically, it comprises two types of layers: Spatial-Mixer and Temporal-Mixer, which respectively capture spatial and temporal information. The Spatial-Mixer extracts spatial information from each frame along the height and width dimensions, while the Temporal-Mixer models temporal information for the same spatial positions across frames. To improve the efficiency of spatial-temporal modeling in our network, we used a spatial-temporal information fusion approach to integrate information at different scales. Additionally, we grouped the input data along the time dimension and designed three different grouping schemes when extracting temporal information. The experimental results indicate that Video-MLP achieved accuracy rates of 87.2% on the Kinetics-400 dataset and 75.3% on the SomethingV2 dataset, outperforming models with equivalent computational complexity. Notably, Video-MLP achieved these results without using convolution and attention mechanisms, and without pre-training on large-scale image and video datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Video-MLP: Convolution-free, attention-free architecture for video classification

Abstract

Talk to us

Similar Papers

More From: Journal of Intelligent & Fuzzy Systems

Lead the way for us

Similar Papers

LAM: Lightweight Attention Module
Qiwei Ji ... Hechang Chen
-
Qiwei Ji, et. al.Qiwei Ji ... Hechang Chen
01 Jan 2021
01 Jan 2021

Deep Spatio-Temporal Representation Learning for Multi-Class Imbalanced Data Classification
Samira Pouyanfar ... Shu-Ching Chen
-
Samira Pouyanfar, et. al.Samira Pouyanfar ... Shu-Ching Chen
01 Jul 2018
01 Jul 2018

Generative adversarial networks based on MLP
Hui Ding ... Zhifen He
-
Hui Ding, et. al.Hui Ding ... Zhifen He
12 Oct 2022
12 Oct 2022

Spatial-Temporal Feature-Based Sports Video Classification
Zengkai Wang
International Journal of Ambient Computing and Intelligence | VOL. 12
Zengkai WangZengkai Wang
01 Oct 2021
International Journal of Ambient Computing and Intelligence | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Video-MLP: Convolution-free, attention-free architecture for video classification

Abstract

Talk to us

Similar Papers

More From: Journal of Intelligent &amp; Fuzzy Systems

More From: Journal of Intelligent & Fuzzy Systems