Abstract
Integrating visual features has been proven effective for deep learning-based speech quality enhancement, particularly in highly noisy environments. However, these models may suffer from redundant information, resulting in performance deterioration when the signal-to-noise ratio (SNR) is relatively high. Real-world noisy scenarios typically exhibit widely varying noise levels. To address the above issues, this study proposes a novel Audio-Visual Speech Enhancement (AVSE) system incorporating audio and visual voice activity information, utilizing attention techniques based on an SNR estimation module, dynamically adjusting the audio and visual endpoint information weights during evaluation based on the environmental noise level. The dynamic modulation makes the model an Endpoint-Aware Network (EANet). The model prioritizes the desired voice period, thereby enhancing speech intelligibility by jointly leveraging noisy acoustic cues and noise-robust visual cues. Experiments are conducted using benchmark datasets. The results indicate that EANet effectively integrates audio and visual information, demonstrating improved performance compared to the audio-only model, especially in scenarios with wide SNR ranges. Therefore, this work shows its efficacy in improving the fusion effectiveness of multimodal information for AVSE, enhancing the quality and intelligibility of the speech.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have