Abstract
We propose an adaptive frame speech analysis scheme through dividing speech signal into stationary and dynamic region. Long frame analysis is used for stationary speech, and short frame analysis for dynamic speech. For computation convenience, the feature vector of short frame is designed to be identical to that of long frame. Two expressions are derived to represent the feature vector of short frames. Word recognition experiments on the TIMIT and NON-TIMIT with discrete Hidden Markov Model (HMM) and continuous density HMM showed that steady performance improvement could be achieved for open set testing. On the TIMIT database, adaptive frame length approach (AFL) reduces the error reduction rates from 4.47% to 11.21% and 4.54% to 9.58% for DHMM and CHMM, respectively. In the NON-TIMIT database, AFL also can reduce the error reduction rates from 1.91% to 11.55% and 2.63% to 9.5% for discrete hidden Markov model (DHMM) and continuous HMM (CHMM), respectively. These results proved the effectiveness of our proposed adaptive frame length feature extraction scheme especially for the open testing. In fact, this is a practical measurement for evaluating the performance of a speech recognition system.
Highlights
To date, the most successful speech recognition systems mainly use Hidden Markov Model (HMM) for acoustic modeling
With the TIMIT database, the adaptive frame length (AFL) gave an error reduction from 4.54% to 9.58% for continuous HMM (CHMM), and from 4.47% to 11.21% for discrete hidden Markov model (DHMM)
With DHMM-based recognizer, FS2 of TIMIT gave the highest error reduction in the open test (11.21%), the lowest in the close test (−1.02%)
Summary
The most successful speech recognition systems mainly use Hidden Markov Model (HMM) for acoustic modeling. Frame-based feature analysis for speech signals has been accepted as a very successful technique. In this method, time speech samples are blocked into frames of N samples, with adjacent frames separated by M samples. N is usually set to be the number of samples of 30–45 ms signal and M to be N/3, [8] This procedure based on the assumption that speech signal could be considered as quasi-stationary if speech signal is examined over a sufficiently short period of time (between 5 and 100 ms). For reducing the discontinuities associated with windowing, pitch synchronously speech processing may be utilized [9, 10] This technique is mainly used for synthesis of speech and rate-reduction speech coding
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have