Abstract
We propose a novel approach of integrating exemplar-based template matching with statistical modeling to improve continuous speech recognition. We choose the template unit to be context-dependent phone segments (triphone context) and use multiple Gaussian mixture model (GMM) indices to represent each frame of speech templates. We investigate two different local distances, log likelihood ratio (LLR) and Kullback-Leibler (KL) divergence, for dynamic time warping (DTW)-based template matching. In order to reduce computation and storage complexities, we also propose two methods for template selection: minimum distance template selection (MDTS) and maximum likelihood template selection (MLTS). We further propose to fine tune the MLTS template representatives by using a GMM merging algorithm so that the GMMs can better represent the frames of the selected template representatives. Experimental results on the TIMIT phone recognition task and a large vocabulary continuous speech recognition (LVCSR) task of telehealth captioning demonstrated that the proposed approach of integrating template matching with statistical modeling significantly improved recognition accuracy over the hidden Markov modeling (HMM) baselines for both TIMIT and telehealth tasks. The template selection methods also provided significant accuracy gains over the HMM baseline while largely reducing the computation and storage complexities. When all templates or MDTS were used, using the LLR local distance gave better performance than the KL local distance. For MLTS and template compression, KL local distance gave better performance than the LLR local distance, and template compression further improved the recognition accuracy on top of MLTS while having less computational cost.
Highlights
In speech recognition, hidden Markov modeling (HMM) has been the dominant approach since it provides a principled way of jointly modeling speech spectral variations and time dynamics
To facilitate comparison of the templates labeled by Gaussian mixture model (GMM) indices, we propose the local distances of log likelihood ratio (LLR) and KullbackLeibler (KL) divergence for dynamic time warping (DTW)-based template matching
5.3 TIMIT phone recognition task On the TIMIT task, we provide a detailed account of the factors in the proposed template matching methods that affect the rescoring performance, including local distances, number of GMMs employed for frame labeling, template selection, compression methods and their interactions with the local distances, and the percentage of selected template representatives
Summary
Hidden Markov modeling (HMM) has been the dominant approach since it provides a principled way of jointly modeling speech spectral variations and time dynamics. With today’s rapid advance in computing power and memory capacity, template-based methods are investigated for large recognition tasks and promising results are reported [10,11,13,14,15,16,17,18] They are still difficult to use in large vocabulary continuous speech recognition (LVCSR) due to their needs for intensive computing time and storage space. Considering the pros and cons of HMMs and template methods, i.e., HMM-based statistical models are effective in compactly representing speech spectral distributions of discrete states but are ineffective in representing the fine details of speech dynamics, while template matching captures well the speech temporal evolutions but demands much larger computational complexity and memory space, it appears plausible to integrate the two approaches so as to exploit their strengths and avoid their weaknesses.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: EURASIP Journal on Audio, Speech, and Music Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.