Abstract

We propose an adaptive frame speech analysis scheme through dividing speech signal into stationary and dynamic region. Long frame analysis is used for stationary speech, and short frame analysis for dynamic speech. For computation convenience, the feature vector of short frame is designed to be identical to that of long frame. Two expressions are derived to represent the feature vector of short frames. Word recognition experiments on the TIMIT and NON-TIMIT with discrete Hidden Markov Model (HMM) and continuous density HMM showed that steady performance improvement could be achieved for open set testing. On the TIMIT database, adaptive frame length approach (AFL) reduces the error reduction rates from 4.47% to 11.21% and 4.54% to 9.58% for DHMM and CHMM, respectively. In the NON-TIMIT database, AFL also can reduce the error reduction rates from 1.91% to 11.55% and 2.63% to 9.5% for discrete hidden Markov model (DHMM) and continuous HMM (CHMM), respectively. These results proved the effectiveness of our proposed adaptive frame length feature extraction scheme especially for the open testing. In fact, this is a practical measurement for evaluating the performance of a speech recognition system.

Highlights

  • To date, the most successful speech recognition systems mainly use Hidden Markov Model (HMM) for acoustic modeling

  • With the TIMIT database, the adaptive frame length (AFL) gave an error reduction from 4.54% to 9.58% for continuous HMM (CHMM), and from 4.47% to 11.21% for discrete hidden Markov model (DHMM)

  • With DHMM-based recognizer, FS2 of TIMIT gave the highest error reduction in the open test (11.21%), the lowest in the close test (−1.02%)

Read more

Summary

Introduction

The most successful speech recognition systems mainly use Hidden Markov Model (HMM) for acoustic modeling. Frame-based feature analysis for speech signals has been accepted as a very successful technique. In this method, time speech samples are blocked into frames of N samples, with adjacent frames separated by M samples. N is usually set to be the number of samples of 30–45 ms signal and M to be N/3, [8] This procedure based on the assumption that speech signal could be considered as quasi-stationary if speech signal is examined over a sufficiently short period of time (between 5 and 100 ms). For reducing the discontinuities associated with windowing, pitch synchronously speech processing may be utilized [9, 10] This technique is mainly used for synthesis of speech and rate-reduction speech coding

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.