Automatic spoken language identification (LID) is the task of identifying the language from a short utterance of the speech signal uttered by an unknown speaker. The most successful approach to LID uses phone recognizers of several languages in parallel [Zissman, M.A., 1996. Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4 (1), 31–44]. The basic requirement to build a parallel phone recognition (PPR) system is segmented and labeled speech corpora. In this paper, a novel approach is proposed for the LID task which uses parallel syllable-like unit recognizers, in a frame work similar to the PPR approach in the literature. The difference is that the sub-word unit models for each of the languages to be recognized are generated in an unsupervised manner without the use of segmented and labeled speech corpora. The training data of each of the languages is first segmented into syllable-like units and language-dependent syllable-like unit inventory is created. These syllable-like units are then clustered using an incremental approach. This results in a set of syllable-like units models for each language. Using these language-dependent syllable-like unit models, language identification is performed based on accumulated acoustic log-likelihoods. Our initial results on the Oregon Graduate Institute Multi-language Telephone Speech Corpus [Muthusamy, Y.K., Cole, R.A., Oshika, B.T., 1992. The OGI multi-language telephone speech corpus. In: Proceedings of Internat. Conf. Spoken Language Process., October 1992, pp. 895–898] show that the performance is 72.3%. We further show that if only a subset of syllable-like unit models that are unique (in some sense) are considered, the performance improves to 75.9%.
Read full abstract