Abstract

Group activity recognition is a prime research topic in video understanding and has many practical applications, such as crowd behavior monitoring, video surveillance, etc. To understand the multi-person/group action, the model should not only identify the individual person’s action in the context but also describe their collective activity. A lot of previous works adopt skeleton-based approaches with graph convolutional networks for group activity recognition. However, these approaches are subject to limitation in scalability, robustness, and interoperability. In this paper, we propose 3DMesh-GAR, a novel approach to 3D human body Mesh-based Group Activity Recognition, which relies on a body center heatmap, camera map, and mesh parameter map instead of the complex and noisy 3D skeleton of each person of the input frames. We adopt a 3D mesh creation method, which is conceptually simple, single-stage, and bounding box free, and is able to handle highly occluded and multi-person scenes without any additional computational cost. We implement 3DMesh-GAR on a standard group activity dataset: the Collective Activity Dataset, and achieve state-of-the-art performance for group activity recognition.

Highlights

  • The purpose of human activity recognition is to identify what a human is doing in a scene using images and video frames or inertial, environmental, and physiological sensors data [1]

  • We evaluate the performance of our method on a benchmark group activity recognition dataset called the Collective Activity Dataset [30]

  • To analyze the performance of the proposed method, we present confusion matrices on the Collective Activity Dataset for multi-person activity recognition, see Figure 4

Read more

Summary

Introduction

The purpose of human activity recognition is to identify what a human is doing in a scene using images and video frames or inertial, environmental, and physiological sensors data [1] It is one of the most active areas of research and an immensely significant component of computer vision and computer graphics fields. It is critical to determine individual activities and interactions of people when recognizing group activity because these actions and interactions often constitute the group activity. Recent studies for this task have explored different methods for feature representations, such as optical flow [5], RGB (i.e., red, green, blue) image sequences [6,7], human skeletons [8], depth image sequences [9], and audio waves [10]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call