Human–object interaction (HOI) detection plays a pivotal role in scene understanding, enabling the identification, localization, and behavioral intention prediction of humans and objects within a visual scene. Conventional approaches, such as graph-based networks, have demonstrated effectiveness in capturing spatio-temporal interaction cues. However, relying solely on visual information within these networks limits their ability to comprehensively grasp the intricate aspects of human interactions. To address this shortcoming, we propose a novel deep scene understanding graph network that harnesses the power of extended text descriptions to represent and interpret interactions between humans and objects effectively. Text serves as a rich source of information, directly conveying the nature of interactions within a visual scene. Our model seamlessly integrates text descriptions with visual features extracted from the entire video sequence, enabling it to capture the context and nuances of human–object interactions. This approach significantly enhances the model’s ability to accurately predict ambiguous human actions and anticipate future interactions. Extensive experiments on two benchmark datasets, CAD-120, and Something-Else, demonstrate that our proposed method achieves remarkable improvements in the F1-Score for both HOI detection and anticipation tasks. This work paves the way for more accurate and comprehensive HOI understanding in various real-world scenarios, highlighting the potential of text-augmented graph networks for effective interaction modeling.
Read full abstract