Abstract

Local spectral distortion measures are commonly used to measure the similarity (or spectral distance) between two given short-time spectra. In this study we compared several different spectral distortion measures including the Itakura-Saito distortion measure, the log likelihood ratio (LLR) distortion measure, the likelihood ratio (LR) distortion measure, the cepstral (CEP) distortion measure, and two proposed perceptually based distortion measures, the weighted likelihood ratio (WLR) and the weighted slope metric (WSM) distortion measures, in terms of their effects on the performance of standard dynamic time warping (DTW) based, isolated word, speech recognizer. Two modifications of the basic forms of each measure were also investigated, namely a Bark-scale frequency warping and the incorporation of suprasegemental energy information. All distortion measures and their modifications were tested on an alpha-digit vocabulary, 4-talker, telephone recording data base. The results can be summarized as: (1) All LPC-based distortion measures performed reasonably well. The log likelihood ratio and weighted slope metric distortion measures gave the highest recognition accuracy, while the Itakura-Saito distortion measure gave the lowest score; (2) Whereas the addition of suprasegmental energy information helped the recognition performance, the use of gain and absolute loudness degraded the performance; (3) Bark-scale frequency warping did not, at least for the highly bandlimited telephone data base we tested, performed as well as its unwarped counterpart; (4) The weighted likelihood ratio distortion measure did not perform as well as its unweighted counterpart.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call