Abstract
Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-of-the-art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement.
Highlights
These days, automatic speech recognition (ASR) performance has improved greatly by using deep neural networks [1,2,3,4,5]
Phonetic transcripts for all sentences are provided in the TIMIT corpus distribution
We proposed a hierarchical speech recognition model based on phoneme clustering
Summary
These days, automatic speech recognition (ASR) performance has improved greatly by using deep neural networks [1,2,3,4,5]. Choosing feature extraction methods and acoustic model types appropriately for confusing phonemes can help improve the final sentence recognition performance. We propose a novel method of applying phoneme-specific acoustic models for automatic speech recognition by a hierarchical phoneme classification framework. The hierarchical phoneme classification is composed of a single, baseline phoneme classifier, clustering into similar groups and final result generation using retrained groupspecific models. ‘d’ and ‘t’ sounds in different words ‘dean’ and ‘teen’, respectively, or ‘b’ and ‘p’ sounds in ‘bad’ and ‘pad’ Those consonants can be distinguished by the existence of the glottal pulse that occurs at periodic time intervals [18,19,20], and we use autocorrelation functions to add the periodicity feature of the phoneme sound if the found phoneme falls into consonant categories.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.