Abstract

Long Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) have recently outperformed other state-of-the-art approaches, such as i-vector and Deep Neural Networks (DNNs), in automatic Language Identification (LID), particularly when dealing with very short utterances (∼3s). In this contribution we present an open-source, end-to-end, LSTM RNN system running on limited computational resources (a single GPU) that outperforms a reference i-vector system on a subset of the NIST Language Recognition Evaluation (8 target languages, 3s task) by up to a 26%. This result is in line with previously published research using proprietary LSTM implementations and huge computational resources, which made these former results hardly reproducible. Further, we extend those previous experiments modeling unseen languages (out of set, OOS, modeling), which is crucial in real applications. Results show that a LSTM RNN with OOS modeling is able to detect these languages and generalizes robustly to unseen OOS languages. Finally, we also analyze the effect of even more limited test data (from 2.25s to 0.1s) proving that with as little as 0.5s an accuracy of over 50% can be achieved.

Highlights

  • Language identification (LID) aims to automatically determine which language is being spoken in a given segment of a speech utterance [1]

  • In order to have a better insight into the behavior of the Long Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) system when dealing with out of set test segments, we show in Fig 4 the confusion matrix of the best out of set system, oos_lstm_2_layer_512_units, when fed with real out of set test utterances

  • We present an analysis of the use of Long Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) for Automatic Language Identification (LID) of short utterances

Read more

Summary

Introduction

Language identification (LID) aims to automatically determine which language is being spoken in a given segment of a speech utterance [1]. In a globalized world where the use of voice-operated systems is more common every day, LID typically acts as a pre-processing stage for both human listeners (i.e. call routing to a proper human operator) and machine systems (i.e. multilingual speech processing systems) [2]. Driven by recent developments in speaker verification, the basic approach of these systems involves using i-vector front-end features followed by a classification stage that compensates speaker and session variabilities [5,6,7]. An i-vector is a fixed-size representation (typically from 400 to 600 dimensions) of a whole utterance, derived as a point estimate of the latent variables

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call