Abstract
Human action recognition is a supervised process of labeling an entire video image sequence with action labels in computer vision fields. Different from the recognition of static images, this process also needs to learn the contact information between video frames, such as the timing characteristics that reflect the changes of actions in the video. In existing deep learning methods, due to the size of the convolution kernel, the models use a small number of consecutive frames as input and trains to assign feature vectors to short sequences instead of the entire sequence. Therefore, even if the learned features contain time information, their evolution over time will be completely ignored. In this work, we propose a dual-stream deep fusion framework that can fully utilize the long-term information of a video. We preprocess the video into static frames and optical flow graphs and input them into a three-dimensional convolutional neural network to obtain the spatiotemporal feature stream with time series. Then, the spatiotemporal feature stream is input into a simple recurrent unit network to learn the long-term sequence features of the time dimension. Finally, SoftMax classifier is used for feature classification. We tested our model on the classic action recognition UCF -101 and HMDB-S1 data sets, and our model achieved better performance than existing methods.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.