Abstract
Temporal Activity Localization via Language (TALL) in video is a recently proposed challenging vision task, and tackling it requires fine-grained understanding of the video content, however, this is overlooked by most of the existing works. In this paper, we propose a novel TALL method which builds a Hierarchical Visual-Textual Graph to model interactions between the objects and words as well as among the objects to jointly understand the video contents and the language. We also design a convolutional network with cross-channel communication mechanism to further encourage the information passing between the visual and textual modalities. Finally, we propose a loss function that enforces alignment of the visual representation of the localized activity and the sentence representation, so that the model can predict more accurate temporal boundaries. We evaluated our proposed method on two popular benchmark datasets: Charades-STA and ActivityNet Captions, and achieved state-of-the-art performances on both datasets. Code is available at https://github.com/forwchen/HVTG .
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.