Towards improving the performance of language identification system for Indian languages

Abitha Anto,K T Sreekumar,P C Reghu Raj,C Santhosh Kumar

doi:10.1109/compsc.2014.7032618

Abstract

In this paper, we present the details of a phonotactic language identification (LID) system developed for five Indian languages, English (Indian), Hindi, Malayalam, Tamil and Kan-nada. Since there are no publicly available speech databases for English, Malayalam and Kannada, we developed the database for each of the target languages by downloading the audio files from YouTube videos and removing the non-speech signals manually. The system was tested using a test data set consisting of 40 utterances with duration of 30, 10, and 3 sees, in each of 5 target languages. The performance evaluation was done separately accordingly to the NIST benchmarking sessions, for 30s, 10s and 3s segments separately. For the baseline system, we got an overall EER of 10.41 %, 19.56 % and 31.45 % for 30, 10, and 3 sees segments when tested with a 3-gram language model. The use of 4-gram language model has helped enhance the performance of the LID system to 9.81 %, 19.38 % and 32.77% respectively for 30,10 and 3 sees test segments. Further, by using the n-gram smoothing, we were able to improve the EER of the LID system, 9.02 %, 18.70 % and 29.24 % for 3-gram language models and 8.88 %, 16.46 % and 32.03 % for 4-gram language models, respectively for 30,10, and 3 sec test segments. The study shows that the use of 4-gram language models can help enhance the performance of LID systems for Indian languages.

Full Text