Abstract
Video visual relation detection, which aims to detect the visual relations between objects in the form of relation triplet <subject, predicate, object> (<i>e.g.</i>, “person-ride-bike”, “dog-toward-car”, <i>etc.</i>), is a significant and fundamental task in computer vision. However, most of the existing works about visual relation instances are focused on static images. Modeling the non-static relationships in videos has drawn little attention due to lacking large-scale video dataset support. In our work, we propose a video dataset named Video Predicate Detection and Reasoning (VidPDR) for dynamic video visual relation detection, which consists of 1,000 videos with dense manually dynamic labeled annotations on 21 object classes and 37 predicates classes. Moreover, we propose a novel spatio-temporal feature extraction framework with 3D Convolutional Neural Networks (ST3DCNN), which includes three modules <b>1)</b> object trajectory, <b>2)</b> short-term relation prediction, and <b>3)</b> greedy relational association. We conducted appropriate experiments on public datasets and our own dataset (VidPDR). Results demonstrate that our proposed method has a great improvement in comparison to the state-of-the-art baselines.
Highlights
As a bridge between dynamic object detection and textual information in video information retrieval, traditional video visual relation detection aims to explore the non-static interaction knowledge among co-occurrent objects in the video
To evaluate the effectiveness of the features extracted by the 3D Convolutional Neural Network, we compared our method with several video visual relation detection models
ST3DCNN which proves that the features extracted by 3D Convolutional Neural Network perform better than some manual encoded features in the video visual relation detection achieved best performance on most of evaluation metrics
Summary
As a bridge between dynamic object detection and textual information in video information retrieval, traditional video visual relation detection aims to explore the non-static interaction knowledge among co-occurrent objects in the video. This method utilized a fully connected graph structure and an energy function with trainable hyperparameters to model the spatial-temporal of object relations in the given video and achieved great performance. These features which are encoded through the last fully connected layer of the network can achieve satisfactory performance in image detection Such imagebased representations which lack the temporal information between adjacent frames in videos are not suitable for the video-based problem. The theoretical support is that the simple visual relationship can be extracted quickly and effectively in short videos, and the complex visual relationship can always be inferred from the simple relationship On this basis, we propose the object trajectory detection module and a relation predict module.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.