Abstract
The presence of degradations in speech signals, which causes acoustic mismatch between training and operating conditions, deteriorates the performance of many speech-based systems. A variety of enhancement techniques have been developed to compensate the acoustic mismatch in speech-based applications. To apply these signal enhancement techniques, however, it is necessary to know prior information about the presence and the type of degradations in speech signals. In this paper, we propose a new convolutional neural network (CNN)-based approach to automatically identify the major types of degradations commonly encountered in speech-based applications, namely additive noise, nonlinear distortion, and reverberation. In this approach, a set of parallel CNNs, each detecting a certain degradation type, is applied to the log-mel spectrogram of audio signals. Experimental results using two different speech types, namely pathological voice and normal running speech, show the effectiveness of the proposed method in detecting the presence and the type of degradations in speech signals which outperforms the state-of-the-art method. Using the score weighted class activation mapping, we provide a visual analysis of how the network makes decision for identifying different types of degradation in speech signals by highlighting the regions of the log-mel spectrogram which are more influential to the target degradation.
Highlights
Advances in portable devices such as smartphones and tablets, that are equipped with high-quality microphones, facilitate capturing and processing speech signals in a wide range of environments
Where t is the time index, s(t) is the clean speech signal recorded by a microphone in a noise-free and nonreverberant environment, e(t) is an additive noise, ψ represents a nonlinear function, h(t) is a room impulse response (RIR), and the ∗ indicates the convolution operation
We used the mPower mobile Parkinson’s disease (MMPD) data set [25] which includes more than 65,000 voice samples of 10 seconds sustained phonations of the vowel /a/ recorded at 44.1 kHz sampling frequency by PD patients and healthy speakers
Summary
Advances in portable devices such as smartphones and tablets, that are equipped with high-quality microphones, facilitate capturing and processing speech signals in a wide range of environments. The quality of the recordings is not necessarily as expected, as they might be subject to degradation. The most common types of degradation typically encountered in speech-based applications are background noise, reverberation, and nonlinear distortion. A speech signal degraded by additive noise, reverberation, and nonlinear distortion can be, respectively, modeled as follows: xn(t) = s(t) + e(t), (1). Where t is the time index, s(t) is the clean speech signal recorded by a microphone in a noise-free and nonreverberant environment, e(t) is an additive noise, ψ represents a nonlinear function, h(t) is a room impulse response (RIR), and the ∗ indicates the convolution operation.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: EURASIP Journal on Audio, Speech, and Music Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.