Abstract

Spoken language identification is the process by which the language in a spoken utterance is recognized automatically. Spoken language identification is commonly used in speech translation systems, in multi-lingual speech recognition, and in speaker diarization. In the current paper, spoken language identification based on deep learning (DL) and the i-vector paradigm is presented. Specifically, a comparative study is reported, consisting of experiments on language identification using deep neural networks (DNN) and convolutional neural networks (CNN). Also, the integration of the two methods into a complete system is investigated. Previous studies demonstrated the effectiveness of using DNN in spoken language identification. However, to date, the integration of CNN and i-vectors in language identification has not been investigated. The main advantage of using CNN is that fewer parameters are required compared to DNN. As a result, CNN is cheaper in terms of memory and the computational power needed. The proposed methods are evaluated on the NIST 2015 i-vector Machine Learning Challenge task for the recognition of 50 in-set languages. Using DNN, a 3.55% equal error rate (EER) was achieved. The EER when using CNN was 3.48%. When DNN and CNN systems were fused, an EER of 3.3% was obtained. The results are very promising, and they also show the effectiveness of using CNN and i-vectors in spoken language identification. The proposed methods are compared to a baseline method based on support vector machines (SVM) and they demonstrated significantly superior performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call