Abstract

The prevalence of monitoring video is critical to public safety, but existing Object Detection and Action Recognition models are overwhelmed by camera operators, unable to identify relevant events. In light of this, Grounding Situation Recognition (GSR) provides a practical solution to recognize the events in a surveillance video. That is, GSR can identify the noun entities (e.g., humans) and their actions (e.g., driving), and provide grounding frames for involved entities. Compared with Action Recognition and Object Detection, GSR is more in line with human cognitive habits, better allowing law enforcement agencies to understand the predictions. However, the crucial issue with most existing frameworks is the neglect of verb ambiguity, that is, superficially similar verbs but have distinct meanings (e.g. buying v.s. giving). Many existing works propose a two-stage model, which first blindly predicts the verb, and then uses this verb information to predict semantic roles. These frameworks ignore the importance of noun information during verb prediction, making them susceptible to misidentifications. To address this problem and better discern between ambiguous verbs, we propose HiFormer, a novel hierarchical transformer framework with direct and comprehensive consideration of similar verbs for each image, to more accurately identify the salient verb, semantic roles, and the grounding frames. Compared with the state-of-the-art models in Grounded Situation Recognition (SituFormer and CoFormer), HiFormer shows an advantage of over 35% and 20% on the Top-1 and Top-5 verb accuracy respectively, as well as 13% on the Top-1 Noun accuracy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.