Abstract
Video action recognition is vital in the research area of computer vision. In this paper, we develop a novel model, named Separable ConvNet Spatiotemporal Mixer (SCSM). Our goal is to develop an efficient and lightweight action recognition backbone that can be applied to multi-task models to increase the accuracy and processing speed. The SCSM model uses a new hierarchical spatial compression, employing the spatiotemporal fusion method, consisting of a spatial domain and a temporal domain. The SCSM model maintains the independence of each frame in the spatial domain for feature extraction and fuses the spatiotemporal features in the temporal domain. The architecture can be adapted to different frame rate requirements due to its high scalability. It is suitable to serve as a backbone for multi-task video feature extraction or industrial applications with its low prediction and training costs. According to the experimental results, SCSM has a low number of parameters and low computational complexity, making it highly scalable with strong transfer learning capabilities. The model achieves video action recognition accuracy comparable to state-of-the-art models with a smaller parameter size and fewer computational requirements.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.