A study of robust omnidirectional audio‐visual talker localization algorithm with microphone array and omnidirectional image

Yuki Denda,Tanaknobu Nishiura,Yoichi Yamashita

doi:10.1121/1.4787207

Abstract

In video conferencing environments, it is very important to localize the talker. However, conventional audio signal‐based algorithms often suffer from audio interference, and conventional visual signal‐based algorithms fail in the presence of visual interference. To deal with these problems, this paper proposes a robust omnidirectional audio‐visual talker localization algorithm that not only exploits audio feature parameters, but also subordinately uses visual feature parameters. To achieve omnidirectional audio‐visual talker localization, paired‐omnidirectional microphones are employed as an audio sensor, and an omnidirectional camera is employed as a visual sensor. For robust talker localization, audio feature parameters are extracted using weighted cross‐power spectrum phase (CSP) analysis and CSP coefficient subtraction, and visual feature parameters are extracted using background subtraction and skin‐color detection. The talker is finally located by the fusing of weighted audio/visual feature parameters, and the weight of this feature parameter fusion is automatically controlled based on the reliable criterion of audio feature parameters. The results of localization experiments in an actual room revealed that the proposed audio‐visual talker localization algorithm is superior to that of conventional localizers using only audio or visual feature parameters, but not both. [Work supported by MEXT of Japan.]

Full Text