Abstract

Action recognition is an application that, ideally, requires real-time results. We focus on single-image-based action recognition instead of video-based because of improved speed and lower cost of computation. However, a single image contains limited information, which makes single-image-based action recognition a difficult problem. To get an accurate representation of action classes, we propose three feature-stream-based shallow sub-networks (image-based, attention-image-based, and part-image-based feature networks) on the deep pose estimation network in a multitasking manner. Moreover, we design the multitask-aware loss function, so that the proposed method can be adaptively trained with heterogeneous datasets where only human pose annotations or action labels are included (instead of both pose and action information), which makes it easier to apply the proposed approach to new data on behavioral analysis on intelligent systems. In our extensive experiments, we showed that these streams represent complementary information and, hence, the fused representation is robust in distinguishing diverse fine-grained action classes. Unlike other methods, the human pose information was trained using heterogeneous datasets in a multitasking manner; nevertheless, it achieved 91.91% mean average precision on the Stanford 40 Actions Dataset. Moreover, we demonstrated the proposed method can be flexibly applied to multi-labels action recognition problem on the V-COCO Dataset.

Highlights

  • The action recognition problem [1,2] can be solved using a video or a single image

  • According to the experimental results, the proposed method achieved 91.91% mean average precision on the Stanford 40 Actions Dataset [15], while having very little effect on the pose estimation task in our multitask learning framework

  • Human action recognition can be considered as a fine-grained classification problem, which we address by our part-image-based feature stream

Read more

Summary

Introduction

The action recognition problem [1,2] can be solved using a video or a single image. Image-based action recognition is the basis for applications such as video-based action recognition and visual question answering. We focus on action recognition from a single image. To infer the human action, the human pose is an essential component [4,5,6,7] because a variety of human behaviors can be categorized based on the pose (such as standing and sitting). “Raising arms” can be further categorized into fine-grained classes such as “Brushing teeth” and “Waving hands”. Action recognition can be considered as a fine-grained classification problem

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call