A multimodal learning interface for grounding spoken language in sensory perceptions

Chen Yu,Dana H Ballard

doi:10.1145/1008722.1008727

Abstract

We present a multimodal interface that learns words from natural interactions with users. In light of studies of human language development, the learning system is trained in an unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. The system collects acoustic signals in concert with user-centric multisensory information from nonspeech modalities, such as user's perspective video, gaze positions, head directions, and hand movements. A multimodal learning algorithm uses this data to first spot words from continuous speech and then associate action verbs and object names with their perceptually grounded meanings. The central ideas are to make use of nonspeech contextual information to facilitate word spotting, and utilize body movements as deictic references to associate temporally cooccurring data from different modalities and build hypothesized lexical items. From those items, an EM-based method is developed to select correct word--meaning pairs. Successful learning is demonstrated in the experiments of three natural tasks: "unscrewing a jar," "stapling a letter," and "pouring water."

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A multimodal learning interface for grounding spoken language in sensory perceptions

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Applied Perception

Lead the way for us

Journal: ACM Transactions on Applied Perception	Publication Date: Jul 1, 2004
Citations: 152

Similar Papers

A multimodal learning interface for word acquisition
D.H Ballard ... Chen Yu
-
D.H Ballard, et. al.D.H Ballard ... Chen Yu
06 Apr 2003
06 Apr 2003

A multimodal learning interface for grounding spoken language in sensory perceptions
Chen Yu ... Dana H Ballard
-
Chen Yu, et. al.Chen Yu ... Dana H Ballard
05 Nov 2003
05 Nov 2003

Embodied Active Vision in Language Learning and Grounding
Chen Yu
-
Chen YuChen Yu
01 Jan 2007
01 Jan 2007

Multi-modal subspace learning with dropout regularization for cross-modal recognition and retrieval
Guanqun Cao ... Alexandros Iosifidis
-
Guanqun Cao, et. al.Guanqun Cao ... Alexandros Iosifidis
01 Dec 2016
01 Dec 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A multimodal learning interface for grounding spoken language in sensory perceptions

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Applied Perception