Localization and recognition of human action in 3D using transformers

Jiankai Sun,Linjiang Huang,Hongsong Wang,Chuanyang Zheng,Jianing Qiu,Md Tauhidul Islam,Enze Xie,Bolei Zhou,Lei Xing,Arjun Chandrasekaran,Michael J Black

doi:10.1038/s44172-024-00272-7

Abstract

Understanding a person’s behavior from their 3D motion sequence is a fundamental problem in computer vision with many applications. An important component of this problem is 3D action localization, which involves recognizing what actions a person is performing, and when the actions occur in the sequence. To promote the progress of the 3D action localization community, we introduce a new, challenging, and more complex benchmark dataset, BABEL-TAL (BT), for 3D action localization. Important baselines and evaluating metrics, as well as human evaluations, are carefully established on this benchmark. We also propose a strong baseline model, i.e., Localizing Actions with Transformers (LocATe), that jointly localizes and recognizes actions in a 3D sequence. The proposed LocATe shows superior performance on BABEL-TAL as well as on the large-scale PKU-MMD dataset, achieving state-of-the-art performance by using only 10% of the labeled training data. Our research could advance the development of more accurate and efficient systems for human behavior analysis, with potential applications in areas such as human-computer interaction and healthcare.

Full Text