Abstract

Human pose estimation has long been motivated for its application in human behavior understanding and activity recognition. Despite recent advances in multi-person pose estimation, existing solutions remain challenging in crowded scenes, especially in classroom scenarios where students are extremely overlapped and have different poses. In this paper, we focus on improving human pose estimation in crowded classrooms from the perspective of crowd detection and pose refinement. Specifically, we first follow a top-down strategy to detect persons in a multi-instance prediction manner and perform single-person pose estimation on each detected human region. Then, the pose estimation is refined with Transformer blocks by capturing the interactions among multiple persons in the image. Importantly, we replace self-attention in Transformer with a lightweight attention mechanism to reduce computational complexity. Quantitative and qualitative experiments demonstrate that our method remarkably outperforms previous methods with a clear margin on both standard benchmarks and self-collected classroom images.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call