A systematic study on a speaker-independent vowel recognition model has been performed. Karhunen-Loève Transformation (KLT), or Principal Component Analysis, technique was applied subsequent to a spectral analysis of the speech signal by 18 non-overlapping critical-band filters. Four experiments have been conducted using selected segments of 8 isolated Putonghua (Mandarin) vowels, spoken twice in 5 tones by 38 females and 13 males. The first experiment uses the same speech sample in training and testing to evaluate the effects of KLT, speaker normalization, distance metric and number of vowel classes. A modified Mahalanobis distance coupled with a 7-class condition was found to give the best performance. In the next experiment, one sample was used to train the model, and another trial of the same speech, spoken by the same group of speakers, was used to test it. It was found that, in general, a sex-specific and tone-specific procedure could be avoided without significant loss in performance. The third experiment repeatedly trained the model with 50 speakers and tested it wiht the remaining one until all 51 speakers had been tested. Under this stringent condition, an average recognition rate of 88.2% was achieved using only 4 classificatory dimensions. In the last experiment, all segments of a vowel were labelled using the most stringent conditions. The model was confirmed to perform well for one male and one female speaker selected at random. Also, the vowel that had caused the greatest confusion was found to be well recognized when treated as an allophone of another vowel. Finally, the possibility of extending the present technique to diphthong recognition is discussed together with some preliminary results.
Read full abstract