Joint Video and Text Parsing for Understanding Events and Answering Queries

Kewei Tu,Meng Meng,Song-Chun Zhu,Mun Wai Lee,Tae Eun Choe

doi:10.1109/mmul.2014.29

Abstract

This article proposes a multimedia analysis framework to process video and text jointly for understanding events and answering user queries. The framework produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events), and causal information (causalities between events and fluents) in the video and text. The knowledge representation of the framework is based on a spatial-temporal-causal AND-OR graph (S/T/C-AOG), which jointly models possible hierarchical compositions of objects, scenes, and events as well as their interactions and mutual contexts, and specifies the prior probabilistic distribution of the parse graphs. The authors present a probabilistic generative model for joint parsing that captures the relations between the input video/text, their corresponding parse graphs, and the joint parse graph. Based on the probabilistic model, the authors propose a joint parsing system consisting of three modules: video parsing, text parsing, and joint inference. Video parsing and text parsing produce two parse graphs from the input video and text, respectively. The joint inference module produces a joint parse graph by performing matching, deduction, and revision on the video and text parse graphs. The proposed framework has the following objectives: to provide deep semantic parsing of video and text that goes beyond the traditional bag-of-words approaches; to perform parsing and reasoning across the spatial, temporal, and causal dimensions based on the joint S/T/C-AOG representation; and to show that deep joint parsing facilitates subsequent applications such as generating narrative text descriptions and answering queries in the forms of who, what, when, where, and why. The authors empirically evaluated the system based on comparison against ground-truth as well as accuracy of query answering and obtained satisfactory results.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Joint Video and Text Parsing for Understanding Events and Answering Queries

Abstract

Talk to us

Similar Papers

More From: IEEE MultiMedia

Lead the way for us

Journal: IEEE MultiMedia	Publication Date: Apr 1, 2014
Citations: 199

Similar Papers

Checking multi-agent schedules with temporal and causal information
Shieu-Hong Lin
-
Shieu-Hong LinShieu-Hong Lin
01 Dec 2009
01 Dec 2009

Joint inference for natural language processing
Andrew Mccallum
-
Andrew MccallumAndrew Mccallum
01 Jan 2009
01 Jan 2009

Natural Language Processing and Computational Linguistics
Junichi Tsujii
Computational Linguistics | VOL. -
Junichi TsujiiJunichi Tsujii
07 Dec 2021
Computational Linguistics | VOL. -

Spatial and temporal learning representation for end-to-end recording device identification
Chunyan Zeng ... Zhifeng Wang
EURASIP Journal on Advances in Signal Processing | VOL. 2021
Chunyan Zeng, et. al.Chunyan Zeng ... Zhifeng Wang
17 Jul 2021
EURASIP Journal on Advances in Signal Processing | VOL. 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Joint Video and Text Parsing for Understanding Events and Answering Queries

Abstract

Talk to us

Similar Papers

More From: IEEE MultiMedia