Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.

Doroteo T Toledano,Alicia Lozano-Diez,María Pilar Fernández-Gallego

doi:10.1371/journal.pone.0205355

Doroteo T Toledano, Alicia Lozano-Diez + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0205355

Copy DOI

Journal: PLOS ONE	Publication Date: Oct 10, 2018
Citations: 29	License type: CC BY 4.0

Affiliation: Autonomous University of Madrid

Abstract

Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR. In particular they produced highly uncorrelated features of small dimensionality (typically 13 coefficients plus deltas and double deltas), which was very convenient for diagonal covariance GMMs, for dealing with the curse of dimensionality and for the limited computing resources of a decade ago. Currently most ASR systems use Deep Neural Networks (DNN) instead of the GMMs for modeling the acoustic features, which provides more flexibility regarding the definition of the features. In particular, acoustic features can be highly correlated and can be much larger in size because the DNNs are very powerful at processing high-dimensionality inputs. Also, the computing hardware has reached a level of evolution that makes computational cost in speech processing a less relevant issue. In this context we have decided to revisit the problem of the time-frequency resolution in speech analysis, and in particular to check if multi-resolution speech analysis (both in time and frequency) can be helpful in improving acoustic modeling using DNNs. Our experiments start with several Kaldi baseline system for the well known TIMIT corpus and modify them by adding multi-resolution speech representations by concatenating different spectra computed using different time-frequency resolutions and different post-processed and speaker-adapted features using different time-frequency resolutions. Our experiments show that using a multi-resolution speech representation tends to improve over results using the baseline single resolution speech representation, which seems to confirm our main hypothesis. However, results combining multi-resolution with the highly post-processed and speaker-adapted features, which provide the best results in Kaldi for TIMIT, yield only very modest improvements.

Highlights

Automatic speech recognition (ASR) aims at converting speech signals into textual representations and is an essential part in data analysis applications that process multimedia content, such as keyword spotting and speaker detection, and in applications that use voice in human-machine interfaces, such as intelligent personal assistants, interactive voice response (IVR) systems and voice search, to name a few.For a period of over two decades, the main paradigm in Automatic Speech Recognition (ASR) was to use Hidden Markov Models (HMMs) to model the temporal evolution of speech and Gaussian Mixture Models (GMMs) to model the acoustic characteristics of speech at each phonetic state [1], using statistical n-gram language models to improve recognition accuracy by modeling the probabilities of different word sequences
In particular deep neural networks (DNNs), have replaced GMMs to model the acoustic characteristics of speech at each phonetic state, but the rest of the architecture is still kept for many practical systems
This way, a feedforward DNN used to perform a classification task might have the following general structure: an input layer, which is fed with some input vectors representing the data; two or more hidden layers, where a transformation is applied to the output of the previous layer, obtaining a higher level representation as we move away from the input layer; and an output layer, which computes the output of the DNN

Summary

Introduction

Automatic speech recognition (ASR) aims at converting speech signals into textual representations and is an essential part in data analysis applications that process multimedia (audio/ video) content, such as keyword spotting and speaker detection, and in applications that use voice in human-machine interfaces, such as intelligent personal assistants, interactive voice response (IVR) systems and voice search, to name a few. In particular deep neural networks (DNNs), have replaced GMMs to model the acoustic characteristics of speech at each phonetic state, but the rest of the architecture is still kept for many practical systems. Attention models have been introduced in deep learning systems to help DNNs focus their attention on specific parts of the input (for instance in relevant parts of an image [11]), and have been applied to help RNNs focus their attention on specific parts of a speech signal to allow end-to-end DNN speech recognition with excellent results [12] on TIMIT and in a larger corpus such as the Wall Street Journal corpus [13]. Interested in trying to find representations of the speech signal that can improve ASR using different deep learning approaches, and explore the use of a multi-resolution (both in time and in frequency) representation of the speech signal to facilitate DNNs learning of the acoustic characteristics of the different phones

Motivation

Materials and methods

Evaluation metrics

Results and discussion

Results with simplified features

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Performance Analysis of various Front-end and Back End Amalgamations for Noise-robust DNN-based ASR
Mohit Dua ... Vinam Agrawal
Recent Advances in Computer Science and Communications | VOL. 14
Mohit Dua, et. al.Mohit Dua ... Vinam Agrawal
01 Dec 2021
Recent Advances in Computer Science and Communications | VOL. 14

The Representation of Speech and Its Processing in the Human Brain and Deep Neural Networks
Odette Scharenborg
-
Odette ScharenborgOdette Scharenborg
01 Jan 2019
01 Jan 2019

Native Language Identification from Spoken Indian English
...
Trends in Electrical Engineering | VOL. 9
, et. al. ...
30 Oct 2019
Trends in Electrical Engineering | VOL. 9

Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling
G Thimmaraja Yadava ... H S Jayanna
International Journal of Speech Technology | VOL. 23
G Thimmaraja Yadava, et. al.G Thimmaraja Yadava ... H S Jayanna
22 Jan 2020
International Journal of Speech Technology | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE