Towards situated speech understanding: visual context priming of language models

Deb Roy,Niloy Mukherjee

doi:10.1016/j.csl.2004.08.003

Abstract

Fuse is a situated spoken language understanding system that uses visual context to steer the interpretation of speech. Given a visual scene and a spoken description, the system finds the object in the scene that best fits the meaning of the description. To solve this task, Fuse performs speech recognition and visually-grounded language understanding. Rather than treat these two problems separately, knowledge of the visual semantics of language and the specific contents of the visual scene are fused during speech processing. As a result, the system anticipates various ways a person might describe any object in the scene, and uses these predictions to bias the speech recognizer towards likely sequences of words. A dynamic visual attention mechanism is used to focus processing on likely objects within the scene as spoken utterances are processed. Visual attention and language prediction reinforce one another and converge on interpretations of incoming speech signals which are most consistent with visual context. In evaluations, the introduction of visual context into the speech recognition process results in significantly improved speech recognition and understanding accuracy. The underlying principles of this model may be applied to a wide range of speech understanding problems including mobile and assistive technologies in which contextual information can be sensed and semantically interpreted to bias processing.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Towards situated speech understanding: visual context priming of language models

Abstract

Talk to us

Similar Papers

More From: Computer Speech & Language

Lead the way for us

Journal: Computer Speech & Language	Publication Date: Sep 23, 2004
Citations: 89

Similar Papers

Markovian combination of language and prosodic models for better speech understanding and recognition
A Stolcke ... E Shriberg
-
A Stolcke, et. al.A Stolcke ... E Shriberg
09 Dec 2001
09 Dec 2001

Natural language call steering for service applications
Wu Chou ... Antoine Saad
-
Wu Chou, et. al.Wu Chou ... Antoine Saad
16 Oct 2000
16 Oct 2000

Speech recognition and understanding
T K Vintsyuk
Cybernetics | VOL. 18
T K VintsyukT K Vintsyuk
01 Jan 1982
Cybernetics | VOL. 18

Robust methods in automatic speech recognition and understanding
Sadaoki Furui
-
Sadaoki FuruiSadaoki Furui
01 Sep 2003
01 Sep 2003

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards situated speech understanding: visual context priming of language models

Abstract

Talk to us

Similar Papers

More From: Computer Speech & Language