Multi-channel spectrograms for speech processing applications using deep learning methods

T Arias-Vergara,E Nöth,P Klumpp,J C Vasquez-Correa,J R Orozco-Arroyave,M Schuster

doi:10.1007/s10044-020-00921-5

Abstract

Time–frequency representations of the speech signals provide dynamic information about how the frequency component changes with time. In order to process this information, deep learning models with convolution layers can be used to obtain feature maps. In many speech processing applications, the time–frequency representations are obtained by applying the short-time Fourier transform and using single-channel input tensors to feed the models. However, this may limit the potential of convolutional networks to learn different representations of the audio signal. In this paper, we propose a methodology to combine three different time–frequency representations of the signals by computing continuous wavelet transform, Mel-spectrograms, and Gammatone spectrograms and combining then into 3D-channel spectrograms to analyze speech in two different applications: (1) automatic detection of speech deficits in cochlear implant users and (2) phoneme class recognition to extract phone-attribute features. For this, two different deep learning-based models are considered: convolutional neural networks and recurrent neural networks with convolution layers.

Highlights

In speech and audio processing applications, the data are commonly processed by computing compressed representations that may not capture the dynamic information of the signals
In our previous work [12], we showed that combining at least two different time–frequency representations of the signals can improve the automatic detection of speech deficits in cochlear implant (CI) users by training a bi-class convolutional neural networks (CNNs) to differentiate between speech signals from CI users and healthy control (HC) speakers
This paper extends the use of multi-channel spectrograms to phoneme recognition using recurrent neural networks with convolutional layers (CRNN)

Summary

Introduction

In speech and audio processing applications, the data are commonly processed by computing compressed representations that may not capture the dynamic information of the signals. In [11] a methodology was presented to enhance noisy audio signals using complex spectrograms and CNNs. In that work, the real and imaginary part of the STFT is computed to form a 2D-channel spectrogram, which is processed by the convolution layers; the amplitude and phase information of the signal are considered to extract the feature maps. Cochleagrams are obtained with a Gammatone filter bank, which is based on the cochlear model proposed in [13], which consists of an array of bandpass filters organized from high frequency at the base of the cochlea, to low frequencies at the apex (innermost part of the cochlea) Both Mel and Gammatone spectrograms are computed based on the STFT whose time and frequency resolutions are determined by the size of the analysis window and the time-shift.

Time–frequency analysis

Continuous wavelet transform

Convolutional neural network

Recurrent neural network with convolution layers

Automatic detection of disordered speech in CI users

Data: CI speech

Preprocessing

Training of the CNN

Phone‐attribute features

Data: Verbmobil

Training of the CGRU

Multi‐channel spectrograms with CGRU

Conclusion

Compliance with ethical standards

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Pattern Analysis & Applications	Publication Date: Sep 24, 2020
Citations: 47	License type: open-access

R Discovery Prime

R Discovery Prime

Multi-channel spectrograms for speech processing applications using deep learning methods

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Pattern Analysis & Applications

Lead the way for us

Similar Papers

Study on the Application of Improved Audio Recognition Technology Based on Deep Learning in Vocal Music Teaching
Nan Liu ... Wenlong Hang
Mathematical Problems in Engineering | VOL. 2022
Nan Liu, et. al.Nan Liu ... Wenlong Hang
18 Aug 2022
Mathematical Problems in Engineering | VOL. 2022

DETECTION OF NETWORK ANOMALIES WITH NEURAL NETWORKS ALGORITHMS
H I Haidur ...
Telecommunication and Information Technologies | VOL. 78
H I Haidur, et. al.H I Haidur ...
01 Jan 2023
Telecommunication and Information Technologies | VOL. 78

Effective forecasting of key features in hospital emergency department: Hybrid deep learning-driven methods
Fouzi Harrou ... Ying Sun
Machine Learning with Applications | VOL. 7
Fouzi Harrou, et. al.Fouzi Harrou ... Ying Sun
12 Nov 2021
Machine Learning with Applications | VOL. 7

Graph Neural Networks for Z-DNA prediction in Genomes
Artem Voytetskiy ... Alan Herbert
-
Artem Voytetskiy, et. al.Artem Voytetskiy ... Alan Herbert
06 Dec 2022
06 Dec 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-channel spectrograms for speech processing applications using deep learning methods

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Pattern Analysis &amp; Applications

More From: Pattern Analysis & Applications