Abstract

In this paper, the estimation of the ideal ratio mask (IRM) has been carried out based on speech cochleagram and visual cues using Audio-Visual Multichannel Convolutional Neural Network (AVMCNN) to enhance the speech signal. Recently several researchers have shown that speech enhancement using visual data as an additional input along with the audio data is more effective in minimizing the acoustic noise present in the speech signal. This work proposes a novel CNN-based audio-visual IRM estimation model. In the proposed audio-visual IRM estimation model, the dynamics of both audio and visual signal features are extracted using multichannel CNN and contextually combined for speech enhancement. The enhanced speech obtained using the proposed model is evaluated based on speech quality and intelligibility. The evaluation results signify that the proposed audio-visual mask estimation model shows improved performance over the audio-only, visual-only, and existing audio-visual mask estimation models. In turn, the proposed AVMCNN model proves its effectiveness in combining the dynamics of the audio features with the visual speech features for speech enhancement.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call