Abstract

In recent years, increasing amounts of video resources have created a series of demands for fine retrieval of video moments, such as highlight moments in sports events and the re-creation of specific video content. In this context, research on cross-modal video segment retrieval, which attempts to output a video moment that matches the input query text, is gradually emerging. Existing solutions primarily focus on global or local feature representation for query text and video moments. However, such solutions ignore matching semantic relations contained in query text and video moments. For example, given the query text “a person is playing basketball, existing retrieval systems may incorrectly return a video moment of “a person holding a basketball without the considering the semantic relationship of “a person playing basketball. Therefore, this paper proposes a cross-modal relationship alignment framework, which we refer to as CrossGraphAlign, for cross-modal video moment retrieval. The proposed framework constructs a textual relationship graph and a visual relationship graph to model the query semantics in text and video segment relations, and then evaluates the similarity between text relations and visual relations through cross-modally aligned graph convolutional networks to help construct a more accurate video moment retrieval system. Experimental results on the publicly available cross-modal video retrieval datasets TACoS and ActivityNet Captions demonstrate that the proposed method can effectively utilize the semantic relationships to improve the recall rate in cross-modal video moment retrieval.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.