Abstract
This contribution aims at speech model-based speech enhancement by exploiting the source-filter model of human speech production. The proposed method enhances the excitation signal in the cepstral domain by making use of a deep neural network (DNN). We investigate two types of target representations along with the significant effects of their normalization. The new approach exceeds the performance of a formerly introduced classical signal processing-based cepstral excitation manipulation (CEM) method in terms of noise attenuation by about 1.5 dB. We show that this gain also holds true when comparing serial combinations of envelope and excitation enhancement. In the important low-SNR conditions, no significant trade-off for speech component quality or speech intelligibility is induced, while allowing for substantially higher noise attenuation. In total, a traditional purely statistical state-of-the-art speech enhancement system is outperformed by more than 3 dB noise attenuation.
Highlights
S PEECH enhancement is still a very important and active field of research
We show the upper limit of the cepstral excitation manipulation (CEM) approach by using the oracle excitation, the new approach CEM-deep neural network (DNN) with start and end decay, and its serial concatenation with the baseline cepstral envelope estimation (CEE), labelled as CEE → CEM-DNN
The noise attenuation of CEM-DNN improves over CEM by up to 1 dB for the −5 dB SNR condition, while increasing MOS-LQO by more than 0.1 points and slightly improving short-time objective intelligibility measure (STOI)
Summary
S PEECH enhancement is still a very important and active field of research. Its primary aim is to improve speech quality and intelligibility, to facilitate the most natural way of communication. Speech signals might be corrupted by, e.g., bandwidth limitation, coupling of noise, echo, and reverberation. Even though traditional systems might be still considered as state of the art, recent advances in speech enhancement make more and more use of modern deep learning technologies and often end-to-end solutions are presented (e.g., [1]–[3]). The authors resolve the problem by enhancing whole utterances on waveform level which requires the availability of complete recordings or at least a very large buffer. This is not applicable for telephony applications, where delay has to be as low as possible and frame-wise processing is essential. More recent advances will be presented briefly
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.