Abstract

Combining multiple acoustic models to improve the overall acoustic model quality is a young and promising direction in Automatic Speech Recognition (ASR). Previous works on acoustic modeling of speech signals such as Random Forests (RFs) of Phonetic Decision Trees (PDTs) has produced significant improvements in recognition accuracy. In this dissertation, several new approaches of using data sampling to construct an Ensemble of Acoustic Models (EAM) for speech recognition are proposed. A straightforward method of data sampling is Cross Validation (CV) data partition. In the direction of improving inter-model diversity within an EAM for speaker independent speech recognition, we propose Speaker Clustering (SC) based data sampling and develop two algorithms, including the Likelihood based Speaker Clustering (LSC) and speaker model Distance based Speaker Clustering (DSC). In the direction of improving base model quality as well as inter-model diversity, we further investigate the effects of several successful techniques of single model training in speech recognition on the proposed ensemble acoustic models, including Cross Validation Expectation Maximization (CVEM), Discriminative Training (DT), and Multiple Layer Perceptron (MLP) features. We also propose using an ensemble of Multiple models with Different Mixture Sizes (MDMS) to improve EAM quality. We have evaluated the proposed methods on TIMIT speaker-independent phoneme recognition task as well as on a telemedicine automatic captioning task of speaker-dependent continuous speech recognition. The proposed EAMs have led to significant improvements in recognition accuracy over conventional Hidden Markov Model (HMM) baseline systems, and the integration of ensemble acoustic models with CVEM, DT and MLP has also significantly improved the accuracy performances of CVEM, DT, and MLP based single model systems. We further investigated the largely unstudied factor of inter-model diversity, and proposed several methods to explicit measure inter-model diversity. We demonstrate a positive relation between enlarging inter-model diversity and increasing EAM quality. HMM-based acoustic models built from data sampling EAM are generally very large, especially when a large number of models or full covariance matrices are used for Gaussian densities. Therefore, compacting the acoustic model to a reasonable size for practical applications while maintaining a reasonable performance is needed. Toward this goal, in this dissertation, we discuss and investigate several distance measures and algorithms for clustering methods. The distance measures include Entropy, KL, Bhattacharyya, Chernoff and their weighted versions. For clustering algorithms, besides the conventional greedy agglomerative clustering, algorithms such as N-Best distance Refinement (NBR), K-step LookAhead (KLA), Breadth-First Search (BFS) are proposed. Experiments on the TIMIT task have shown that in comparison with the original EAM model, the compacted models using the clustering methods can maintain the model accuracy, while the size of the compacted model is largely decreased. Experiments in compacting EAM on a Pashto ASR task have shown that the proposed clustering methods can lead to better quality than the conventional clustering methods. Unlike the implicit PDT based states tying that has been used in most ASR systems as well as in the recent RF based PDTs, explicit PDT (EPDT) state tying that allows Phoneme data Sharing (PS) is considered for its potential capability in capturing pronunciation variations. The ensemble approach of combining multiple acoustic models is applied to the EPDT, where a combination of explicit PDT and implicit PDT models has been investigated to reduce phone confusions.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.