Abstract

Conclusions about variation and change of vowels based on the first (F1) and second (F2) formant frequencies may be challenged by at least two major drawbacks intrinsic to this simple 2-D representation. First, F1 and F2 measured at one or two representative points can capture only limited amount of vowel dynamics. The numeric values of F1 and F2 are also prone to errors due to aperiodicity in speech signal. In this study, we explore alternative ways for more informative and robust vowel representation. We examine both LPC and MFCC based features using two encoding methods. Frequency domain features are first estimated at each time step for the entire vowel duration. The resulting t feature vectors are then passed sequentially through either a set of t independent autoencoders or a recurrent neural network (RNN), both with k hidden units. In case of the autoencoders, each k-by-t vowel matrix is further projected to a lower dimensional space using PCA. In case of the RNN, the k-dimensional hidden vectors are used for cross-condition comparisons. Compared to the simple F1-F2 measurement, our methods are able to capture further nuanced within-category variations across dialect varieties in TIMIT. Results from LPC and MFCC based representations are also compared.Conclusions about variation and change of vowels based on the first (F1) and second (F2) formant frequencies may be challenged by at least two major drawbacks intrinsic to this simple 2-D representation. First, F1 and F2 measured at one or two representative points can capture only limited amount of vowel dynamics. The numeric values of F1 and F2 are also prone to errors due to aperiodicity in speech signal. In this study, we explore alternative ways for more informative and robust vowel representation. We examine both LPC and MFCC based features using two encoding methods. Frequency domain features are first estimated at each time step for the entire vowel duration. The resulting t feature vectors are then passed sequentially through either a set of t independent autoencoders or a recurrent neural network (RNN), both with k hidden units. In case of the autoencoders, each k-by-t vowel matrix is further projected to a lower dimensional space using PCA. In case of the RNN, the k-dimensional hidden vectors a...

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call