Abstract
Combining convolutional neural networks (CNNs) and recurrent neural networks (RNNs) produces a powerful architecture for video classification problems as spatial–temporal information can be processed simultaneously and effectively. Using transfer learning, this paper presents a comparative study to investigate how temporal information can be utilized to improve the performance of video classification when CNNs and RNNs are combined in various architectures. To enhance the performance of the identified architecture for effective combination of CNN and RNN, a novel action template-based keyframe extraction method is proposed by identifying the informative region of each frame and selecting keyframes based on the similarity between those regions. Extensive experiments on KTH and UCF-101 datasets with ConvLSTM-based video classifiers have been conducted. Experimental results are evaluated using one-way analysis of variance, which reveals the effectiveness of the proposed keyframe extraction method in the sense that it can significantly improve video classification accuracy.
Highlights
Video has become more popular in many applications in recent years due to increased storage capacity, more advanced network architectures, as well as easy access to digital cameras, especially in mobile phones
This paper presents a comparative study to investigate how temporal information can be utilized to improve the performance of video classification when convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are combined in various architectures
To enhance the performance of the identified architecture for effective combination of CNN and RNN, a novel action template-based keyframe extraction method is proposed by identifying the informative region of each frame and selecting keyframes based on the similarity between those regions
Summary
Video has become more popular in many applications in recent years due to increased storage capacity, more advanced network architectures, as well as easy access to digital cameras, especially in mobile phones. More than 500 h of video is uploaded onto the Internet every minute and sharp rise in the number of videos is expected to continue in the coming decades due to the increase in demand for video content [1]. This increase is a remarkable issue and brings serious. Combining CNNs and RNNs has achieved good results [7, 8], the representation of temporal information is still a demanding problem due to complex variations in actions and dynamic background in videos
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.