Abstract

Active Speaker Detection(ASD) in a noisy environment is a challenging problem. Audio modality does not help significantly when the level of noise is high, so usage of bimodal information like vision and speech together for ASD in a meeting room scenario is explored in this work. The method proposed in this work uses an audio-visual sensor array placed in circular geometry. Time difference of arrival(TDOA) is used to find the pivot mic corresponding to the active speaker, Directional of arrival(DOA) using audio modality is used to localise speaker and TDOA features are used in bimodal decision level fusion. In video modality active speaker is localised by lip activity detection method, followed by stereo triangulation. A pivot camera is also detected and used to extract correlation based video features. Multimodal active speaker detection is performed by the weighted fusion of audio and video modality decisions. A decision level fusion is used herein for improved ASD performance. Extensive experiments are performed on the M-MONC database and video recorded using lab testbed. The Multimodal approach proposed in this work for ASD indicates better performance than the existing clustering based method in terms of detection rate.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call