Integrated exemplar-based template matching and statistical modeling for continuous speech recognition

Xie Sun,Yunxin Zhao

doi:10.1186/1687-4722-2014-4

Abstract

We propose a novel approach of integrating exemplar-based template matching with statistical modeling to improve continuous speech recognition. We choose the template unit to be context-dependent phone segments (triphone context) and use multiple Gaussian mixture model (GMM) indices to represent each frame of speech templates. We investigate two different local distances, log likelihood ratio (LLR) and Kullback-Leibler (KL) divergence, for dynamic time warping (DTW)-based template matching. In order to reduce computation and storage complexities, we also propose two methods for template selection: minimum distance template selection (MDTS) and maximum likelihood template selection (MLTS). We further propose to fine tune the MLTS template representatives by using a GMM merging algorithm so that the GMMs can better represent the frames of the selected template representatives. Experimental results on the TIMIT phone recognition task and a large vocabulary continuous speech recognition (LVCSR) task of telehealth captioning demonstrated that the proposed approach of integrating template matching with statistical modeling significantly improved recognition accuracy over the hidden Markov modeling (HMM) baselines for both TIMIT and telehealth tasks. The template selection methods also provided significant accuracy gains over the HMM baseline while largely reducing the computation and storage complexities. When all templates or MDTS were used, using the LLR local distance gave better performance than the KL local distance. For MLTS and template compression, KL local distance gave better performance than the LLR local distance, and template compression further improved the recognition accuracy on top of MLTS while having less computational cost.

Highlights

In speech recognition, hidden Markov modeling (HMM) has been the dominant approach since it provides a principled way of jointly modeling speech spectral variations and time dynamics
To facilitate comparison of the templates labeled by Gaussian mixture model (GMM) indices, we propose the local distances of log likelihood ratio (LLR) and KullbackLeibler (KL) divergence for dynamic time warping (DTW)-based template matching
5.3 TIMIT phone recognition task On the TIMIT task, we provide a detailed account of the factors in the proposed template matching methods that affect the rescoring performance, including local distances, number of GMMs employed for frame labeling, template selection, compression methods and their interactions with the local distances, and the percentage of selected template representatives

Summary

Introduction

Hidden Markov modeling (HMM) has been the dominant approach since it provides a principled way of jointly modeling speech spectral variations and time dynamics. With today’s rapid advance in computing power and memory capacity, template-based methods are investigated for large recognition tasks and promising results are reported [10,11,13,14,15,16,17,18] They are still difficult to use in large vocabulary continuous speech recognition (LVCSR) due to their needs for intensive computing time and storage space. Considering the pros and cons of HMMs and template methods, i.e., HMM-based statistical models are effective in compactly representing speech spectral distributions of discrete states but are ineffective in representing the fine details of speech dynamics, while template matching captures well the speech temporal evolutions but demands much larger computational complexity and memory space, it appears plausible to integrate the two approaches so as to exploit their strengths and avoid their weaknesses.

Related work and system overview

Log likelihood ratio local distance

Maximum-likelihood-based template selection

Template compression

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Feb 1, 2014
Citations: 30	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Integrated exemplar-based template matching and statistical modeling for continuous speech recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

Integrate template matching and statistical modeling for continuous speech recognition
Xie Sun
-
Xie SunXie Sun
01 Jan 2010
01 Jan 2010

Hybrid Connectionist Models For Continuous Speech Recognition
Hervé Bourlard ... Nelson Morgan
-
Hervé Bourlard, et. al.Hervé Bourlard ... Nelson Morgan
01 Jan 1996
01 Jan 1996

Acoustic Models for Posterior Features in Speech Recognition

-

01 Jan 2008
01 Jan 2008

A HYBRID CONTINUOUS SPEECH RECOGNITION SYSTEM USING SEGMENTAL NEURAL NETS WITH HIDDEN MARKOV MODELS
G Zavaliagkos ... J Makhoul
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 07
G Zavaliagkos, et. al.G Zavaliagkos ... J Makhoul
01 Aug 1993
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 07

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Integrated exemplar-based template matching and statistical modeling for continuous speech recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing