Dual attentional transformer for video visual relation prediction

Mingcheng Qu,Ganlin Deng,Donglin Di,Jianxun Cui,Tonghua Su

doi:10.1016/j.neucom.2023.126372

Abstract

Video visual relation detecti on (VidVRD) is to detect visual relations among instances as well as the trajectories of the corresponding subjects and objects in the video. Most current works improve the accuracy of tracking the objects but neglect the other key challenge, predicting the reliable visual relations in the videos, a vital meant for downstream tasks further. In this paper, we propose a dual attentional transformer network (VRD-DAT) for predicting the visual relations, also known as the predicates, in multi-relation videos. Specifically, our network first respectively targets modeling action visual predicates (Act-T) and spatial locating visual relations (Spa-T) via two parallel visual transformer structures simultaneously. Then, an attentional weighting module obtains the final precise merged visual relations. We conduct extensive experiments on two public datasets, ImageNet-VidVRD and VidOR, to demonstrate our model is capable of outperforming other state-of-the-art methods effectively on the task of video visual relation prediction. Quantitative and qualitative results also show that with more accurate visual relations, the performance of the video visual relation detection task can be further boosted.

Full Text