Abstract

The presence of multi-speaker babble noise greatly degrades speech intelligibility for communications with cochlear implant (CI) users. Convolutional Neural Network (CNN) based speech enhancement has been popular for suppressing noise because it can localize spectro-temporal information within the feature set. However, suppressing noise without creating artifacts is challenging in low SNR environments, and even more so if the noise is speech-like such as Babble noise. Transformers have emerged as a useful architecture with innate global self-attention mechanisms to capture long-range dependencies and global context, which is a limitation for CNN. The Mel spectrogram has a strong correlation along both time and frequency axes, and thus captures contextual information beneficial for speech enhancement. In this study, we propose TSUNet, which leverages both Transformers and U-Net to capture sufficient low-level details of contextual information in the speech time-frequency domain. TSUNet employs Transformer layers at the bottleneck of CNN based on U-Net. We also incorporate a neural vocoder to synthesize speech from the time-frequency representation without using a contaminated phase. The performance of the proposed network is evaluated under simulated and real recordings of noisy speech including CI testing. The proposed network achieves very effective performance in both scenarios.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call