Temporal moment localization in videos using natural language (TMLVNL) is challenging problem in computer vision. TMLVNL aims to determine the correct moment in a lengthy, untrimmed video that corresponds to the input query. In addition to its inherent complexity, TMLVNL encounters several additional difficulties that can degrade performance, such as rare object positions, occlusions, camera focus issues, and motion blurriness. To address these issues, this study proposes a novel solution called the Temporal Ziggurat Transformer Network. First, we developed a novel method rather than relying on various combinational approaches. Second, we proposed complicated scenarios, such as unusual object postures, object occlusions, camera focus issues, and motion blurriness, by incorporating specialized blocks into our Customized Ziggurat Transformer (ZT) to thoroughly explore visual features. Third, to facilitate the understanding of visual features associated with query words, we proposed a query word-specific transformer (QWST) as a submodule of ZT. QWST integrates query word feature representations with extensively investigated visual features. Fourth, in our module named STDF, we managed query-sentence representations along with query word attributes to extract semantic context from video chunks. The moment was then localized, with its start and end borders identified using the moment localization module. Comprehensive experiments on the Charades-STA, TACos, and Activity-Netcaption datasets demonstrated that our strategy outperformed existing state-of-the-art methods.
Read full abstract