HMM Mixtures (HMM2) for Robust Speech Recognition

Katrin Weber

doi:10.5075/epfl-thesis-2790

Abstract

State-of-the-art automatic speech recognition (ASR) techniques are typically based on hidden Markov models (HMMs) for the modeling of temporal sequences of feature vectors extracted from the speech signal. At the level of each HMM state, Gaussian mixture models (GMMs) or artificial neural networks (ANNs) are commonly used in order to model the state emission probabilities. However, both GMMs and ANNs are rather rigid, as they are incapable of adapting to variations inherent in the speech signal, such as inter- and intra-speaker variations. Moreover, performance degradations of these systems are severe in the case of unmatched conditions such as in the presence of environmental noise. A lot of research effort is currently being devoted to overcoming these problems. The principal objective of this thesis is to explore new approaches towards a more robust and adaptive modeling of speech. In this context, different aspects of the modeling of speech data with HMMs and GMMs are investigated. Particular attention is given to the modeling of correlation. While correlation between different feature vectors (corresponding to temporal correlation) is typically modeled by the HMM, correlation between feature vector components (e.g., correlation in frequency) is modeled by the GMM part of the model. This thesis starts with the investigation of two potential ways to improve the modeling of correlation, consisting of (1) a shift of the modeling of temporal correlation towards GMMs, and (2) the modeling of correlation within each feature vector by a particular type of HMM. This leads to the development of a novel approach, referred to as OHMM2O, which is a major focus of this thesis. HMM2 is a particular mixture of hidden Markov models, where state emission probabilities of the temporal (primary) HMM are modeled through (secondary) state-dependent frequency-based HMMs. Low-dimensional GMMs are used for modeling the state emission probabilities of the secondary HMM states. Therefore, HMM2 can be seen as a generalization of conventional HMMs, which they include as a particular case. HMM2 may have several advantages as compared to standard systems. While the primary HMM performs time warping and time integration, the secondary HMM performs warping and integration along the frequency dimension of the speech signal. Frequency correlation is modeled through the secondary HMM topology. Due to the implicit, non-linear, state-dependent spectral warping performed by the secondary HMM, HMM2 may be viewed as a dynamic extension of the multi-band approach. Moreover, this frequency warping property may result in a better, more flexible modeling and parameter sharing. After an investigation of theoretical and practical aspects of HMM2, encouraging recognition results for the case of speech degraded by additive noise are given. Due to the spectral warping property of HMM2, this model is able to extract pertinent structural information of the speech signal, which is reflected in the trained model parameters. Consequently, such an HMM2 system can also be used to explicitly extract structures of a speech signal, which can then be converted into a new kind of ASR features, referred to as OHMM2 featuresO. In fact, frequency bands with similar characteristics are supposed to be emitted by the same secondary HMM state. The warping along the frequency dimension of speech thus results in an adaptable, data-driven frequency segmentation. In fact, as it can be assumed that different secondary HMM states model spectral regions characterized by high and low energies respectively, this segmentation may be related to formant structures. The application of HMM2 as a feature extractor is investigated, and it is shown that a system combining HMM2 features with conventional noise-robust features yields an improved speech recognition robustness. Moreover, a comparison of HMM2 features with formant tracks shows a comparable performance on a vowel classification task.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

HMM Mixtures (HMM2) for Robust Speech Recognition

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition
Dimitri Palaz ... Ronan Collobert
Speech Communication | VOL. 108
Dimitri Palaz, et. al.Dimitri Palaz ... Ronan Collobert
30 Jan 2019
Speech Communication | VOL. 108

Sentence‐HMM state‐based i‐vector/PLDA modelling for improved performance in text dependent single utterance speaker verification
Osman Büyük
IET Signal Processing | VOL. 10
Osman BüyükOsman Büyük
01 Oct 2016
IET Signal Processing | VOL. 10

PERFORMANCE EVALUATION OF DIFFERENT ASR CLASSIFIERS ON MOBILE DEVICE
...
-
, et. al. ...
27 Apr 2021
27 Apr 2021

An investigation of high-resolution modeling units of deep neural networks for acoustic scene classification
Xiao Bao ... Li-Rong Dai
-
Xiao Bao, et. al.Xiao Bao ... Li-Rong Dai
01 May 2017
01 May 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

HMM Mixtures (HMM2) for Robust Speech Recognition

Abstract

Talk to us

Similar Papers