NIST Language Recognition Evaluation Research Articles

Bottleneck features (BNFs) generated with a deep neural network (DNN) have proven to boost spoken language recognition accuracy over basic spectral features significantly. However, BNFs are commonly extracted using language-dependent tied-context phone states as learning targets. Moreover, BNFs are less phonetically expressive than the output layer in a DNN, which is usually not used as a speech feature because of its very high dimensionality hindering further post-processing. In this article, we put forth a novel deep learning framework to overcome all of the above issues and evaluate it on the 2017 NIST Language Recognition Evaluation (LRE) challenge. We use manner and place of articulation as speech attributes, which lead to low-dimensional “universal” phonetic features that can be defined across all spoken languages. To model the asynchronous nature of the speech attributes while capturing their intrinsic relationships in a given speech segment, we introduce a new training scheme for deep architectures based on a Maximal Figure of Merit (MFoM) objective. MFoM introduces non-differentiable metrics into the backpropagation-based approach, which is elegantly solved in the proposed framework. The experimental evidence collected on the recent NIST LRE 2017 challenge demonstrates the effectiveness of our solution. In fact, the performance of speech language recognition (SLR) systems based on spectral features is improved for more than 5% absolute Cavg. Finally, the F1 metric can be brought from 77.6% up to 78.1% by combining the conventional baseline phonetic BNFs with the proposed articulatory attribute features.

Read full abstract

A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine learning have led to significant improvements, LID performance is still lacking, especially for short duration speech utterances. With the hypothesis that language information is weak and represented only latently in speech, and is largely dependent on the statistical properties of the speech content, existing representations may be insufficient. Furthermore they may be susceptible to the variations caused by different speakers, specific content of the speech segments, and background noise. To address this, we propose using Deep Bottleneck Features (DBF) for spoken LID, motivated by the success of Deep Neural Networks (DNN) in speech recognition. We show that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability. To evaluate the effectiveness of this, we design two acoustic models, termed DBF-TV and parallel DBF-TV (PDBF-TV), using a DBF based i-vector representation for each speech utterance. Results on NIST language recognition evaluation 2009 (LRE09) show significant improvements over state-of-the-art systems. By fusing the output of phonotactic and acoustic approaches, we achieve an EER of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. Furthermore, various DBF configurations have been extensively evaluated, and an optimal system proposed.

Read full abstract

NIST Language Recognition Evaluation Research Articles

Articles published on NIST Language Recognition Evaluation

Maximal Figure-of-Merit Framework to Detect Multi-Label Phonetic Features for Spoken Language Recognition

Supervised I-vector modeling for language and accent recognition

Direct Optimization of the Detection Cost for I-Vector-Based Spoken Language Recognition

On the use of deep feedforward neural networks for automatic language identification

Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks.

Phonotactic language recognition using dynamic pronunciation and language branch discriminative information

Frame-by-frame language identification in short utterances using deep neural networks

Deep bottleneck features for spoken language identification.

Maximum A Posteriori Linear Regression for language recognition

Time-Frequency Cepstral Features and Combining Discriminative Training for Phonotactic Language Recognition

Language Identification: A Tutorial

Extraction and representation of prosodic features for language and speaker recognition

Automatic Language Identification with Discriminative Language Characterization Based on SVM

Spoken Language Recognition Using Ensemble Classifiers

A Vector Space Modeling Approach to Spoken Language Identification

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

NIST Language Recognition Evaluation Research Articles

Articles published on NIST Language Recognition Evaluation

Maximal Figure-of-Merit Framework to Detect Multi-Label Phonetic Features for Spoken Language Recognition

Supervised I-vector modeling for language and accent recognition

Direct Optimization of the Detection Cost for I-Vector-Based Spoken Language Recognition

On the use of deep feedforward neural networks for automatic language identification

Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks.

Phonotactic language recognition using dynamic pronunciation and language branch discriminative information

Frame-by-frame language identification in short utterances using deep neural networks

Deep bottleneck features for spoken language identification.

Maximum A Posteriori Linear Regression for language recognition

Time-Frequency Cepstral Features and Combining Discriminative Training for Phonotactic Language Recognition

Language Identification: A Tutorial

Extraction and representation of prosodic features for language and speaker recognition

Automatic Language Identification with Discriminative Language Characterization Based on SVM

Spoken Language Recognition Using Ensemble Classifiers

A Vector Space Modeling Approach to Spoken Language Identification