Abstract

Video object codiscovery can leverage the weak semantic constraint implied by sentences that describe the video content. Our codiscovery method, like other object codetection techniques, does not employ any pretrained object models or detectors. Unlike most prior work that focuses on codetecting large objects which are usually salient both in size and appearance, our method can discover small or medium sized objects as well as ones that may be occluded for part of the video. More importantly, our method can codiscover multiple object instances of different classes within a single video clip. Although the semantic information employed is usually simple and weak, it can greatly boost performance by constraining the hypothesized object locations. Experiments show promising results on three datasets: an average IoU score of 0.423 on a new dataset with 15 object classes, an average IoU score of 0.373 on a subset of CAD-120 with 5 object classes, and an average IoU score of 0.358 on a subset of MPII-Cooking with 7 object classes. Our result on this subset of MPII-Cooking improves upon those of the previous state-of-the-art methods by significant margins.

Highlights

  • We address the problem of video object codiscovery: naming and localizing novel objects in a set of videos, by placing bounding boxes around those objects, without any pretrained object detectors

  • We demonstrate that the constraint provided by sentence semantics can significantly aid video object discovery in many instances

  • Our approach takes advantage of datasets with the following properties: (I) It applies to video that depicts motion and changing spatial relations between objects. (II) The video is paired with temporally aligned sentences that describe that motion and those changing spatial relations

Read more

Summary

Introduction

We address the problem of video object codiscovery: naming and localizing novel objects in a set of videos, by placing bounding boxes around those objects, without any pretrained object detectors This problem is essentially one of video object codetection: given a set of videos that contain instances of a common object class, locate those instances simultaneously. Our method can codetect small or medium sized objects, as well as ones that are occluded for part of the video. It can codetect multiple object instances of different classes both within a single video clip and across a set of video clips. Our approach is a form of weakly supervised learning, where hidden structure is inferred from weakly labeled data

Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call