In real urban driving scenes, human actions are very complex and have the characteristic of multiple concurrent actions. It has a great significance to detect human actions in urban traffic scenes for auxiliary or autonomous driving systems. In this view, we introduce the TITAN-Human Action dataset for the task of multi-person spatial–temporal action detection in urban driving scenes. TITAN-Human Action provides the fine-grained action labels and location coordinates for 17,574 persons in the processed frames from the TITAN dataset. Furthermore, we propose a semantics-guided detection network (SGDNet) based on a semantic inference module (SIM) for spatial–temporal human action detection in urban driving scenes. SIM encodes the category labels into sentence vectors at the semantic level with prompting and embedding, utilizes graphs to represent the directed co-occurrence relations between categories, and adopts the graph convolutional network for semantic inference. SGDNet exploits the inference results of SIM to guide the visual branch in better performing human action detection, thereby achieving the integration of visual and linguistic information. We conducted experiments to evaluate SGDNet and several baseline methods on the TITAN-Human Action dataset, and reveal the generalizability of SIM in spatial–temporal human action detection. The source code and annotation files will be available at https://github.com/yyhbswyn/SGDNet.
Read full abstract