Abstract

Lately, the development of deep learning algorithms has marked milestones in the field of speech processing. In particular, the release of pre-trained feature extraction models has considerably simplified the development of speech classification and recognition algorithms. However, environmental noise and reverberation still negatively affect the whole performance, making robustness in noisy conditions mandatory in real-world applications. One way to mitigate the noise effect is to integrate a speech enhancement front-end that removes artifacts from the desired speech signals. Unlike the state-of-the-art enhancement approaches that operate either on speech spectrogram or directly on time-domain signals, in this paper, we study how enhancement can be applied directly on the speech embeddings, extracted using Wav2Vec and WavLM models. Moreover, we investigate a variety of training approaches, considering different flavors of joint and disjoint training of the speech enhancement front-end with the classification/recognition back-end. We perform exhaustive experiments on the Fluent Speech Commands and Google Speech Commands datasets contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, as well as on the LibriSpeech dataset, contaminated with noises from the MUSAN dataset, considering intent classification, keyword spotting, and speech recognition tasks respectively. Results show that directly enhancing the speech embedding is a viable, computationally effective approach, and provide insights about the most promising training approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call