Abstract

Research in language identification require corpus of multi-languages speech data to capture the distinguishable information within and across languages. In the past few decades, many statistical approaches to language identification have been developed based on two common and public-domain corpora which consist of telephone speech from about 26 languages and dialects. However, the China's minority languages have not been used as the target languages in the published papers up to now. In our work, we select 9 typical China's minority languages and Mandarin to construct our telephone speech corpus. These minority languages are composed of Naxi, Miao, Bai, Dai, Yi, Zhuang, Uygur language, Mongolian and Tibetan. Each minority language represents its minority nationality. The corpus can be used to study, develop, evaluate and compare minority languages identification algorithms. Moreover, it will promote the Linguistic researchers to pay more attention to the long history and splendid culture of our national minorities. China's minority languages identification and designing training and recognition algorithms to treat them based on this speech corpus without phonetic transcription. Using this speech corpus, we will propose the LID methods for all China's minority languages with the Chinese loanwords. The work will promote the research and application of language identification technology in China. In our work, we select 9 typical China's minority languages and Mandarin to construct our telephone speech corpus. These minority languages are composed of Naxi, Miao, Bai, Dai, Yi, Zhuang, Uygur language, Mongolian, Tibetan (7). Each minority language represents its minority nationality. The corpus can be used to study, develop, evaluate and compare minority languages identification algorithms. Moreover, it will promote the Linguistic researchers to pay more attention to our national minority. The rest of this paper is organized as follows. In section two, we briefly review the common and public- domain corpora and language recognition evaluation. Third part will describe the speech collection process. Then we describe our minority languages corpus in the fourth part. Part five introduces the post processing to our corpus. In the last we present conclusions and proposals for future work.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call