Abstract

While human activity recognition and pose estimation are closely related, these two issues are usually treated as separate tasks. In this thesis, two-dimension and three-dimension pose estimation is obtained for human activity recognition in a video sequence, and final activity is determined by combining it with an activity algorithm with visual attention. Two problems can be solved efficiently using a single architecture. It is also shown that end-to-end optimization leads to much higher accuracy than separated learning. The proposed architecture can be trained seamlessly with different categories of data. For visual attention, soft visual attention is used, and a multilayer recurrent neural network using long short term memory that can be used both temporally and spatially is used. The image, pose estimated skeleton, and RGB-based activity recognition data are all synthesized to determine the final activity to increase reliability. Visual attention evaluates the model in UCF-11 (Youtube Action), HMDB-51 and Hollywood2 data sets, and analyzes how to focus according to the scene and task the model is performing. Pose estimation and activity recognition are tested and analyzed on MPII, Human3.6M, Penn Action and NTU data sets. Test results are Penn Action 98.9%, NTU 87.9%, and NW-UCLA 88.6%.

Highlights

  • Human activity recognition and pose estimation have attracted many applications such as video-based recognition and human–computer interfaces

  • We proposed activity recognition that considers visual attention, pose estimation, and activity recognition

  • Through the visual attention algorithm, weights are added to the necessary parts to enable attention calculation

Read more

Summary

Introduction

Human activity recognition and pose estimation have attracted many applications such as video-based recognition and human–computer interfaces. Since most of the pose estimation methods perform heat map prediction, the two tasks have yet to be combined and joint optimization has not been performed This detection-based approach requires a function that maximizes the value to recover the joint coordinates as a post-processing step, which breaks the backpropagation loop required for end-to-end learning. If this problem is solved, the pose estimation method and the activity recognition method, which are very closely related, can be processed together to achieve higher accuracy. Xu et al [17] mainly worked on the caption generation of static images, and this paper focuses on using the soft attention mechanism for activity recognition in video. The part marked in white is the part that has a higher weight by applying visual attention

Activity Recognition
Pose Estimation
Loss Function and Attention Penalty
Pose Sequence Modelling
Experiments
Datasets
Experimental Environment and Parameter Setting
Experiment Result
Methods
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call