Abstract

Traditional speech enhancement uses only audio signals to obtain clean speech by removing background noise. Audio-visual speech enhancement uses additional visual information to improve the intelligibility and perceptual quality of noisy speech, which can be applied to video conferencing. However, the original visual cues contain redundant information, which causes high latency and makes the model prone to overfitting. Meanwhile, related methods do not pay much attention to exploiting critical visual features and fusing multimodal audio-visual features. In this paper, we propose an efficient multimodal feature fusion network (MFF-Net) for audio-visual speech enhancement. Specifically, we are the first to use fine-grained 3D lip landmarks to represent visual features. This visual representation can provide refined 3D visual information while protecting privacy. We design a multi-scale enhancement module (MEM) with a multi-branch structure that extracts critical multi-scale features by using dilated convolutions and attention mechanisms. Besides, we propose an audio-visual fusion module (AFM) that uses a mutually reinforcing strategy to effectively fuse visual and audio features. To verify the effectiveness of our method, we construct two speech enhancement datasets based on lip landmarks and conduct related experiments. Extensive experimental results show that our proposed MFF-Net has competitive performance compared to other methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call