Attention Guided Relation Detection Approach for Video Visual Relation Detection

Qianwen Cao,Heyan Huang

doi:10.1109/tmm.2021.3109430

Abstract

Video Visual Relation Detection (VidVRD) aims at detecting the relation instances between two observed objects in the form of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$< $</tex-math></inline-formula> <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">subject-predicate-object</i> <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$>$</tex-math></inline-formula> . Unlike image visual relation detection, due to the introduction of the time dimensions, the various predicates and spatial-temporal locations are both required to be tackled, making the task challenging. To balance these challenges, most existing works perform this task in two phases: first predicting relationships in segmented clips to capture the motions, and then associating them into the relation instances with proper locations in videos. These works detect different relationships by collecting the cues from multi-aspects, but treat them equally without distinction. Furthermore, due to the dynamic scenes and drifting problem in object tracking, the rigid spatial overlap used to determine the association in previous works is insufficient, which leads to missing associations. To address the problems, in this paper, we propose a novel attention guided relation detection approach for VidVRD. In order to model the distinction among different cues and strengthen the salient characteristics, we assign these cues the attention weights for relationship prediction and association decision-making. In addition, to comprehensively measure whether merging the relationships, we put forward a customized network to take both visual appearance and geometric location into account. Extensive experiment results on ImageNet-VidVRD dataset and VidOR dataset demonstrate the effectiveness of our proposed approach. And abundant ablation studies verify the component designed in the approach is essential.

Full Text