Abstract

In speech perception, the visual information obtained by observing the speaker’s face can account for up to 6 and 10 dB improvements in the presence of wide-band Gaussian and speech-babble noise, respectively. Current hearing aids and other speech enhancement devices do not utilize the visual input from the speaker's face, limiting their functionality. To alleviate this shortcoming, audio-visual speech enhancement algorithms have been developed by including video information in the audio processing. We developed an audio-visual voice activity detector (VAD) that combines audio features such as long-term spectral divergence with video features such as spatio-temporal gradients of the mouth area. The contributions of various features are learned by maximizing the mutual information between the audio and video features in an unsupervised fashion. Segmental SNR (SSNR) values were estimated to compare the benefits of audio-visual and conventional audio-only VADs. VAD outputs were utilized by an adaptive Wiener filter to estimate the noise spectrum, and enhance speech corrupted by Gaussian and speech-babble noise. The SSNR improvements were similar in low-noise conditions, but the output using the audio-visual VAD was on average 8 dB better in high-noise. This shows that video can provide complementary information when audio is very noisy, leading to significant performance improvements.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call