Statistical language identification based on untranscribed training

M.A Lund,K Ma,H Gish

doi:10.1109/icassp.1996.543240

Abstract

BBN's baseline language identification (LID) system tokenizes utterances based on an English hidden Markov model (HMM) phone recognizer and uses language-dependent phone-bigram models to discriminate between languages. This is clearly a suboptimal procedure, as English phone models may fail to provide a meaningful tokenization of non-English speech. We address this problem through the use of parametric acoustic segment models derived from untranscribed target-language training. The paper describes some promising exploratory experiments related to model-segment selection and LID based on nearest neighbor classification, as well as an LID system in which language-specific HMMs are trained from unsupervised clustering of parametric acoustic segments. In addition, it describes an experiment in which phone HMMs trained on CallHome Mandarin are added to the baseline system, resulting in an error reduction of 30% on pairwise language discrimination on the OGI-TS corpus.

Full Text