Abstract

Video visual relation detection, which aims to detect the visual relations between objects in the form of relation triplet &#x003C;subject, predicate, object&#x003E; (<i>e.g.</i>, &#x201C;person-ride-bike&#x201D;, &#x201C;dog-toward-car&#x201D;, <i>etc.</i>), is a significant and fundamental task in computer vision. However, most of the existing works about visual relation instances are focused on static images. Modeling the non-static relationships in videos has drawn little attention due to lacking large-scale video dataset support. In our work, we propose a video dataset named Video Predicate Detection and Reasoning (VidPDR) for dynamic video visual relation detection, which consists of 1,000 videos with dense manually dynamic labeled annotations on 21 object classes and 37 predicates classes. Moreover, we propose a novel spatio-temporal feature extraction framework with 3D Convolutional Neural Networks (ST3DCNN), which includes three modules <b>1)</b> object trajectory, <b>2)</b> short-term relation prediction, and <b>3)</b> greedy relational association. We conducted appropriate experiments on public datasets and our own dataset (VidPDR). Results demonstrate that our proposed method has a great improvement in comparison to the state-of-the-art baselines.

Highlights

  • As a bridge between dynamic object detection and textual information in video information retrieval, traditional video visual relation detection aims to explore the non-static interaction knowledge among co-occurrent objects in the video

  • To evaluate the effectiveness of the features extracted by the 3D Convolutional Neural Network, we compared our method with several video visual relation detection models

  • ST3DCNN which proves that the features extracted by 3D Convolutional Neural Network perform better than some manual encoded features in the video visual relation detection achieved best performance on most of evaluation metrics

Read more

Summary

INTRODUCTION

As a bridge between dynamic object detection and textual information in video information retrieval, traditional video visual relation detection aims to explore the non-static interaction knowledge among co-occurrent objects in the video. This method utilized a fully connected graph structure and an energy function with trainable hyperparameters to model the spatial-temporal of object relations in the given video and achieved great performance. These features which are encoded through the last fully connected layer of the network can achieve satisfactory performance in image detection Such imagebased representations which lack the temporal information between adjacent frames in videos are not suitable for the video-based problem. The theoretical support is that the simple visual relationship can be extracted quickly and effectively in short videos, and the complex visual relationship can always be inferred from the simple relationship On this basis, we propose the object trajectory detection module and a relation predict module.

OBJECT TRAJECTORY DETECTION
Method
RELATION PREDICTION
EXPERIMENT
DATASETS
EVALUATION METRICS
COMPARED METHODS
RESULT
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call