Abstract

Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-of-the-art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement.

Highlights

  • These days, automatic speech recognition (ASR) performance has improved greatly by using deep neural networks [1,2,3,4,5]

  • Phonetic transcripts for all sentences are provided in the TIMIT corpus distribution

  • We proposed a hierarchical speech recognition model based on phoneme clustering

Read more

Summary

Introduction

These days, automatic speech recognition (ASR) performance has improved greatly by using deep neural networks [1,2,3,4,5]. Choosing feature extraction methods and acoustic model types appropriately for confusing phonemes can help improve the final sentence recognition performance. We propose a novel method of applying phoneme-specific acoustic models for automatic speech recognition by a hierarchical phoneme classification framework. The hierarchical phoneme classification is composed of a single, baseline phoneme classifier, clustering into similar groups and final result generation using retrained groupspecific models. ‘d’ and ‘t’ sounds in different words ‘dean’ and ‘teen’, respectively, or ‘b’ and ‘p’ sounds in ‘bad’ and ‘pad’ Those consonants can be distinguished by the existence of the glottal pulse that occurs at periodic time intervals [18,19,20], and we use autocorrelation functions to add the periodicity feature of the phoneme sound if the found phoneme falls into consonant categories.

Phoneme Clustering
Phonemes
Baseline Phoneme Recognition with TIMIT Dataset
Confusion Matrix
Phoneme Clustering Using Confusion Matrix
Hierarchical Phoneme Classification
Overall Architecture
Vowels and Mixed Phoneme Classification
Varying Analysis Window Sizes for Consonants
Voiced and Unvoiced Consonants Classification
TIMIT Database
Various Window Sizes
Phoneme Group Model Training
Performance of the Hierarchical Classification
Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.