Abstract

In this dissertation, a novel approach of integrating template matching with statistical modeling is proposed to improve continuous speech recognition. Hidden Markov Modeling (HMMs) has been the dominant approach in statistical speech recognition since it provides a principled way of jointly modeling speech spectral variations and time dynamics. However, HMMs have the shortcoming of assuming the observations being independent within each state, which makes it ineffective in modeling the details of speech temporal evolutions that are important for characterizing nonstationary speech sounds. Template-based methods make comparisons between a test pattern and the templates derived from training data, and therefore they are able to capture speech dynamics and time correlation of speech frames better than HMM based methods. However, template matching requires large memory space and computational time since feature vectors of training data need to be stored in computer memory for access at the recognition stage, which is difficult in large vocabulary continuous speech recognition (LVCSR). Our proposed approach takes advantages of both statistical modeling and template matching, which overcomes the weakness of conventional template-based method and is feasible for LVCSR. We use multiple Gaussian Mixture Model (GMM) indices to represent each frame of speech templates, and define the template unit to be context-dependent phone segments (triphone context). We also use phonetic decision trees borrowed from those commonly used in HMMs to tie triphone templates and predict triphones unseen in training data. Two local distances, log likelihood ratio (LLR) and Kullback-Leibler (KL) divergence, are proposed for dynamic time warping (DTW) based template matching. In order to reduce computational complexity and storage space, we propose methods of minimum distance template selection (MDTS) and maximum log-likelihood template selection (MLTS), and investigate a template compression method on top of template selection to further improve recognition performance. The template based methods were used to rescore lattices generated by baseline HMMs on the tasks of TIMIT continuous phone recognition and teleheath LVCSR and experimental results demonstrated that the proposed approach of integrating template matching with statistical modeling significantly improved recognition performances over the HMM baselines. The template selection methods also provided significant recognition accuracy improvements over the HMM baseline while largely reducing the computation and storage complexities. When all templates or MDTS were used, using the LLR local distance obtained better recognition performance than the KL divergence local distance. For MLTS and template compression, KL divergence local distance provided better performance than the LLR local distance, and the template compression method made further improvements over KL based MLTS. Since the templates were constructed based on the GMM indices extracted from HMM baselines, we also validated the effectiveness of the proposed template methods based on enhanced HMM baselines. Experimental results showed that LLR based all template method was able to consistently improve TIMIT phone recognition accuracies based on four enhanced HMM baselines. Prosodic features such as duration, energy, and pitch can reflect longer span information of speech than conventional single frame vectors but they have commonly been ignored by HMMs. Template based methods provide possibilities to conveniently integrate prosodic features into speech recognition, which has not been well studied in the past. In this dissertation, we investigate combining template based methods with the speech prosodic features of duration, energy and pitch to further improve speech recognition accuracy. The scores of prosodic information were computed by a GMM based method and a non-parametric method, and the prosodic scores were combined with the acoustic scores in triphone template matching. Experimental results obtained on the telehealth task showed that prosodic information had positive effects on vowel sound recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call