Models For Continuous Speech Recognition Research Articles

We propose a novel approach of integrating exemplar-based template matching with statistical modeling to improve continuous speech recognition. We choose the template unit to be context-dependent phone segments (triphone context) and use multiple Gaussian mixture model (GMM) indices to represent each frame of speech templates. We investigate two different local distances, log likelihood ratio (LLR) and Kullback-Leibler (KL) divergence, for dynamic time warping (DTW)-based template matching. In order to reduce computation and storage complexities, we also propose two methods for template selection: minimum distance template selection (MDTS) and maximum likelihood template selection (MLTS). We further propose to fine tune the MLTS template representatives by using a GMM merging algorithm so that the GMMs can better represent the frames of the selected template representatives. Experimental results on the TIMIT phone recognition task and a large vocabulary continuous speech recognition (LVCSR) task of telehealth captioning demonstrated that the proposed approach of integrating template matching with statistical modeling significantly improved recognition accuracy over the hidden Markov modeling (HMM) baselines for both TIMIT and telehealth tasks. The template selection methods also provided significant accuracy gains over the HMM baseline while largely reducing the computation and storage complexities. When all templates or MDTS were used, using the LLR local distance gave better performance than the KL local distance. For MLTS and template compression, KL local distance gave better performance than the LLR local distance, and template compression further improved the recognition accuracy on top of MLTS while having less computational cost.

Read full abstract

Currently, most speech recognition architectures model the speech signal as a nonoverlapping sequence of phonetic segments. A set of phonetic models is created that attempt to capture the acoustic-phonetic properties of individual phones, but do not explicitly model the transition between phones. It is readily apparent, however, that these transitions contain important information about the identity of neighboring phones. While context-dependent phonetic modeling may capture some of this information, it is likely that more explicit models of phonetic transitions could offer performance improvements. In this talk, the use of phonetic transition models will be discussed within the context of summit, a segment-based continuous speech recognition system [Zue etal., ‘‘Acoustic Segmentation and Phonetic Classification in the summit Speech Recognition System,’’ Proc. ICASSP 89, pp. 389–392, Glasgow, Scotland (1989)]. The transition models use a feature vector based on Mel-frequency spectral coefficients (MFSC’s). The vector is created by concatenating multiple spectral averages on both sides of a transition. For example, in one configuration a total of eight averages were used which spanned a total time interval of 150 ms. In order to reduce the number of dimensions, a principal component analysis was performed. A set of diagonal Gaussian models is used to model the transitions. The models were tested by applying them to the N-Best sentence hypotheses from the recognition system. Each of the N-Best hypotheses is rescored using a linear combination (optimized on training data) of the segment and transition scores. Initial experiments have resulted in 10%–20% reductions in word error rates.

Read full abstract

Models For Continuous Speech Recognition Research Articles

Articles published on Models For Continuous Speech Recognition

Building Acoustic and Language Model for Continuous Speech Recognition in Bahasa Indonesia

Arabic phonemes recognition using hybrid LVQ/HMM model for continuous speech recognition

Review on Acoustic Modeling for Continuous Speech Recognition

Integrated exemplar-based template matching and statistical modeling for continuous speech recognition

Syllable modeling in continuous speech recognition for Tamil language

Framework for Choosing a Set of Syllables and Phonemes for Lithuanian Speech Recognition

A hierarchical Bayesian model for continuous speech recognition

Audio-visual speech modeling for continuous speech recognition

A Bayesian approach for building triphone models for continuous speech recognition

Phonetic transition modeling for continuous speech recognition

A HYBRID CONTINUOUS SPEECH RECOGNITION SYSTEM USING SEGMENTAL NEURAL NETS WITH HIDDEN MARKOV MODELS

Context modeling with the stochastic segment model

Development of an acoustic-phonetic hidden Markov model for continuous speech recognition

Interword coarticulation modeling for continuous speech recognition

On the use of triphone models for continuous speech recognition

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Models For Continuous Speech Recognition Research Articles

Articles published on Models For Continuous Speech Recognition

Building Acoustic and Language Model for Continuous Speech Recognition in Bahasa Indonesia

Arabic phonemes recognition using hybrid LVQ/HMM model for continuous speech recognition

Review on Acoustic Modeling for Continuous Speech Recognition

Integrated exemplar-based template matching and statistical modeling for continuous speech recognition

Syllable modeling in continuous speech recognition for Tamil language

Framework for Choosing a Set of Syllables and Phonemes for Lithuanian Speech Recognition

A hierarchical Bayesian model for continuous speech recognition

Audio-visual speech modeling for continuous speech recognition

A Bayesian approach for building triphone models for continuous speech recognition

Phonetic transition modeling for continuous speech recognition

A HYBRID CONTINUOUS SPEECH RECOGNITION SYSTEM USING SEGMENTAL NEURAL NETS WITH HIDDEN MARKOV MODELS

Context modeling with the stochastic segment model

Development of an acoustic-phonetic hidden Markov model for continuous speech recognition

Interword coarticulation modeling for continuous speech recognition

On the use of triphone models for continuous speech recognition