Abstract

Video visual relation detecti on (VidVRD) is to detect visual relations among instances as well as the trajectories of the corresponding subjects and objects in the video. Most current works improve the accuracy of tracking the objects but neglect the other key challenge, predicting the reliable visual relations in the videos, a vital meant for downstream tasks further. In this paper, we propose a dual attentional transformer network (VRD-DAT) for predicting the visual relations, also known as the predicates, in multi-relation videos. Specifically, our network first respectively targets modeling action visual predicates (Act-T) and spatial locating visual relations (Spa-T) via two parallel visual transformer structures simultaneously. Then, an attentional weighting module obtains the final precise merged visual relations. We conduct extensive experiments on two public datasets, ImageNet-VidVRD and VidOR, to demonstrate our model is capable of outperforming other state-of-the-art methods effectively on the task of video visual relation prediction. Quantitative and qualitative results also show that with more accurate visual relations, the performance of the video visual relation detection task can be further boosted.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call