Abstract

Speech enhancement (SE) aims to improve speech quality and intelligibility by removing acoustic corruption. While various SE models using audio-only (AO) based on deep learning have been developed to achieve successful enhancement for non-speech background noise, audio-visual SE (AVSE) models have been studied to effectively remove competing speech. In this paper, we propose an AVSE model that estimates spectral masks for real and imaginary components to consider phase enhancement. It is based on the U-net structure that allows the decoder to perform information restoration by leveraging intermediate information in the encoding process and avoids the gradient vanishing problem by providing paths direct to the encoder’s layers. In the proposed model, we present early fusion to process audio and video with a single encoder that effectively generates features for the fused information easy to decode for SE with reduced parameters of the encoder and decoder. Moreover, we extend the U-net using the proposed Recurrent-Neural-Network (RNN) attention (RA) blocks and the Res paths (RPs) in the skip connections and the encoder. While the RPs are introduced to resolve the semantic gap between the low-level and high-level features, the RA blocks are developed to find efficient representations with inherent frequency-specific characteristics for speech as a type of time-series data. Experimental results on the LRS2-BBC dataset demonstrated that AV models successfully removed competing speech and our proposed model efficiently estimated complex spectral masks for SE. When compared with the conventional U-net model with a comparable number of parameters, our proposed model achieved relative improvements of about 7.23%, 5.21%, and 22.9% for the signal-to-distortion ratio, perceptual evaluation of speech quality, and FLOPS, respectively.

Highlights

  • Speech enhancement (SE) aims to improve sound quality and intelligibility by removing acoustic corruption from noisy speech recorded in real-world environments

  • We reported the results for the U-net by increasing the number of convolutional neural network (CNN) filters in the encoder so that the U-net had a comparable number of parameters with the RPU-net, RAEU-net, and RAES U-net

  • We proposed an audio-visual SE (AVSE) model that could remove even competing speech based on the U-net structure to estimate spectral masks for real and imaginary components

Read more

Summary

INTRODUCTION

Speech enhancement (SE) aims to improve sound quality and intelligibility by removing acoustic corruption from noisy speech recorded in real-world environments. Since the speaker’s lip or facial movements are directly related to the speech, our AVSE model processes audio and visual information with a single encoder to obtain fused features from an early stage. Every frame to estimate attention weights reflecting the inherent frequency-specific characteristics of the local features at each level while similar spatial attention weights were obtained by a convolutional layer in [14] To this end, as shown, the local feature AVv at time step t in this level is used as an input, and the AVgap is obtained by global average pooling (GAP) along the channel axis, and the weight is estimated to provide the relative importance between 0 and 1 through the LSTM and sigmoid function σ(·). In addition to being applied to each level of the encoder, it is applied to skip connections after applying the RP

CRITERION FUNCTION
EXPERIMENTS AND RESULTS
DATASET
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call