Abstract

Recurrent neural network (RNN) models have been proven successful in modeling the temporal dynamics in videos of which Long Short-Term Memory (LSTM) networks have been specifically successful as it does not suffer from the vanishing gradient problem. They along with Convolutional Neural Networks (CNN) for visual feature extraction are popularly referred as the Long-term Recurrent Convolutional Networks (LRCN) and have been widely accepted in the recent times for activities like video activity classification, video captioning and video description. The features for these models may be generated using single spatial stream or dual streams, both spatial and motion streams from the video frames. The paper is a study on how the State-of-the-Art networks like ResNet50, InceptionV3 and MobileNet perform with fine tuning for spatial feature extraction in the task of activity recognition in videos using LRCN with stacked LSTM. The fine-tuning approach and optimization settings for the extraction of the visual features from the State-of-the-Art pretrained networks is also discussed in this paper.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.