Look, Listen and Learn — A Multimodal LSTM for Speaker Identification

Jimmy Ren,Li Xu,Wenxiu Sun,Yu-Wing Tai,Qiong Yan,Yongtao Hu,Chuan Wang

doi:10.1609/aaai.v30i1.10471

Abstract

Speaker identification refers to the task of localizing the face of a person who has the same identity as the ongoing voice in a video. This task not only requires collective perception over both visual and auditory signals, the robustness to handle severe quality degradations and unconstrained content variations are also indispensable. In this paper, we describe a novel multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies both visual and auditory modalities from the beginning of each sequence input. The key idea is to extend the conventional LSTM by not only sharing weights across time steps, but also sharing weights across modalities. We show that modeling the temporal dependency across face and voice can significantly improve the robustness to content quality degradations and variations. We also found that our multimodal LSTM is robustness to distractors, namely the non-speaking identities. We applied our multimodal LSTM to The Big Bang Theory dataset and showed that our system outperforms the state-of-the-art systems in speaker identification with lower false alarm rate and higher recognition accuracy.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Look, Listen and Learn — A Multimodal LSTM for Speaker Identification

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence	Publication Date: Mar 5, 2016
Citations: 32

Similar Papers

Audition dominates vision in duration perception irrespective of salience, attention, and temporal discriminability.
Laura Ortega ... Marcia Grabowecky
Attention, Perception & Psychophysics | VOL. 76
Laura Ortega, et. al.Laura Ortega ... Marcia Grabowecky
08 May 2014
Attention, Perception & Psychophysics | VOL. 76

Bottleneck and Embedding Representation of Speech for DNN-based Language and Speaker Recognition
Alicia Lozano-Diez ... Javier Gonzalez-Dominguez
-
Alicia Lozano-Diez, et. al.Alicia Lozano-Diez ... Javier Gonzalez-Dominguez
21 Nov 2018
21 Nov 2018

Performance Evaluation of Speaker Identification in Language and Emotion Mismatch Conditions on Eastern and North Eastern Low Resource Languages of India
Joyanta Basu ... Swanirbhar Majumder
-
Joyanta Basu, et. al.Joyanta Basu ... Swanirbhar Majumder
14 Nov 2021
14 Nov 2021

Speaker identification using convolutional-long short-term memory neural networks
Serkan Tokgoz ... Issa M Panahi
The Journal of The Acoustical Society of America | VOL. 146
Serkan Tokgoz, et. al.Serkan Tokgoz ... Issa M Panahi
01 Oct 2019
The Journal of The Acoustical Society of America | VOL. 146

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Look, Listen and Learn — A Multimodal LSTM for Speaker Identification

Abstract

Talk to us

Similar Papers

More From: Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence