Unsupervised Discriminative Training With Application to Dialect Classification

Rongqing Huang,John H L Hansen

doi:10.1109/tasl.2007.903302

Abstract

Automatic dialect classification has gained interest in the field of speech research because of its importance in characterizing speaker traits and knowledge estimation which could improve integrated speech technology (e.g., speech recognition, speaker recognition). This study addresses novel advances in unsupervised spontaneous dialect classification in English and Spanish. The problem considers the case where no transcripts are available for training and test data, and speakers are talking spontaneously. The Gaussian mixture model (GMM) is used for unsupervised dialect classification in our study. Techniques which aim to deal with confused acoustic regions in the GMMs are proposed, where confused regions in the GMMs are identified through data driven methods. The first technique excludes confused regions by finding dialect dependence in the untranscribed audio by selecting the most discriminative Gaussian mixtures [mixture selection (MS)]. The second technique includes the confused regions in the model, but the confused regions are balanced over all classes. This technique is implemented by identifying discriminative frames and confused frames in the audio data [frame selection (FS)]. The new confused regions contribute to model representation but does not impact classification performance. The third technique is to reduce the confused regions in the original model. Minimum classification error (MCE) is applied to achieve this objective. All three techniques implement discriminative training for GMM-based classification. Both the first technique (MS-GMM, GMM trained with mixture selection) and the second technique (FS-GMM, GMM trained with frame selection) improve dialect classification performance. Further improvement is achieved after applying the third technique (MCE training) before the first or second techniques. The system is evaluated using British English dialects and Latin American Spanish dialects. Measurable improvement is achieved in both corpora. Finally, the system is compared with human listener performance, and shown to outperform human listeners in terms of classification accuracy.

Full Text