Abstract

The underlying assumption for spectral/temporal features for use in automatic speech recognition is that the frequency resolution should be emphasized in relation to temporal resolution. Accordingly, Mel frequency cepstral coefficients are typically computed using an approximately 25‐ms frame length with a 10‐ms frame spacing, and using 3–5 frames to represent temporal derivative information. In phone recognition experiments based on the TIMIT database using discrete cosine transform coefficients for spectral information and discrete cosine series coefficients for their temporal evolution, substantially higher phone accuracies were obtained with much shorter frame lengths (8 ms), much shorter frame spacings (2 ms), and much longer time intervals for capturing spectral tracks (on the order of 500 ms). Experimental results with various conditions are given for phone recognition using the TIMIT database. The implications of these results are that spectral/temporal evolution features, emphasizing the temporal aspects, are of great importance for automatic speech recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call