A CNN-based approach to identification of degradations in speech signals

Yuki Saishu,Mads Græsbøll Christensen,Amir Hossein Poorjam

doi:10.1186/s13636-021-00198-4

Yuki Saishu, Mads Græsbøll Christensen + Show 1 more

Open Access

https://doi.org/10.1186/s13636-021-00198-4

Copy DOI

Abstract

The presence of degradations in speech signals, which causes acoustic mismatch between training and operating conditions, deteriorates the performance of many speech-based systems. A variety of enhancement techniques have been developed to compensate the acoustic mismatch in speech-based applications. To apply these signal enhancement techniques, however, it is necessary to know prior information about the presence and the type of degradations in speech signals. In this paper, we propose a new convolutional neural network (CNN)-based approach to automatically identify the major types of degradations commonly encountered in speech-based applications, namely additive noise, nonlinear distortion, and reverberation. In this approach, a set of parallel CNNs, each detecting a certain degradation type, is applied to the log-mel spectrogram of audio signals. Experimental results using two different speech types, namely pathological voice and normal running speech, show the effectiveness of the proposed method in detecting the presence and the type of degradations in speech signals which outperforms the state-of-the-art method. Using the score weighted class activation mapping, we provide a visual analysis of how the network makes decision for identifying different types of degradation in speech signals by highlighting the regions of the log-mel spectrogram which are more influential to the target degradation.

Highlights

Advances in portable devices such as smartphones and tablets, that are equipped with high-quality microphones, facilitate capturing and processing speech signals in a wide range of environments
Where t is the time index, s(t) is the clean speech signal recorded by a microphone in a noise-free and nonreverberant environment, e(t) is an additive noise, ψ represents a nonlinear function, h(t) is a room impulse response (RIR), and the ∗ indicates the convolution operation
We used the mPower mobile Parkinson’s disease (MMPD) data set [25] which includes more than 65,000 voice samples of 10 seconds sustained phonations of the vowel /a/ recorded at 44.1 kHz sampling frequency by PD patients and healthy speakers

Summary

Introduction

Advances in portable devices such as smartphones and tablets, that are equipped with high-quality microphones, facilitate capturing and processing speech signals in a wide range of environments. The quality of the recordings is not necessarily as expected, as they might be subject to degradation. The most common types of degradation typically encountered in speech-based applications are background noise, reverberation, and nonlinear distortion. A speech signal degraded by additive noise, reverberation, and nonlinear distortion can be, respectively, modeled as follows: xn(t) = s(t) + e(t), (1). Where t is the time index, s(t) is the clean speech signal recorded by a microphone in a noise-free and nonreverberant environment, e(t) is an additive noise, ψ represents a nonlinear function, h(t) is a room impulse response (RIR), and the ∗ indicates the convolution operation.

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Feb 5, 2021
Citations: 4	License type: open-access

R Discovery Prime

R Discovery Prime

A CNN-based approach to identification of degradations in speech signals

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

Voice pathology detection using convolutional neural networks with electroglottographic (EGG) and speech signals
Rumana Islam ... Mohammed Tarique
Computer Methods and Programs in Biomedicine Update | VOL. 2
Rumana Islam, et. al.Rumana Islam ... Mohammed Tarique
01 Jan 2021
Computer Methods and Programs in Biomedicine Update | VOL. 2

Deep Learning Based Pathological Voice Detection Algorithm Using Speech and Electroglottographic (EGG) Signals
Rumana Islam ... Esam Abdel-Raheem
-
Rumana Islam, et. al.Rumana Islam ... Esam Abdel-Raheem
23 Nov 2022
23 Nov 2022

Investigation of self-supervised pre-trained models for classification of voice quality from speech and neck surface accelerometer signals
Sudarsana Reddy Kadiri ... Paavo Alku
Computer Speech & Language | VOL. 83
Sudarsana Reddy Kadiri, et. al.Sudarsana Reddy Kadiri ... Paavo Alku
28 Jul 2023
Computer Speech & Language | VOL. 83

On the use of Deep Learning and Scattering Transform for Pathological voices recognition
S Souli ... R Amami
-
S Souli, et. al.S Souli ... R Amami
17 May 2022
17 May 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A CNN-based approach to identification of degradations in speech signals

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing