Bottleneck and Embedding Representation of Speech for DNN-based Language and Speaker Recognition
Automatic speech recognition has experienced a breathtaking progress in the last few years, partially thanks to the introduction of deep neural networks into their approaches. This evolution in speech recognition systems has spread across related areas such as language and speaker recognition, where deep neural networks have noticeably improved their performance. In this PhD thesis, we have explored different approaches to the tasks of speaker and language recognition, focusing on systems where deep neural networks become part of traditional pipelines, replacing some stages or the whole system itself. Specifically, in the first experimental block, end-to-end language recognition systems based on deep neural networks are analyzed, where the neural network is used as classifier directly, without the use of any other backend but performing the language recognition task from the scores (posterior probabilities) provided by the network. Besides, these research works are focused on two architectures, convolutional neural networks and long short-term memory (LSTM) recurrent neural networks, which are less demanding in terms of computational resources due to the reduced amount of free parameters in comparison with other deep neural networks. Thus, these systems constitute an alternative to classical i-vectors, and achieve comparable results to them, especially when dealing with short utterances. In particular, we conducted experiments comparing a system based on convolutional neural networks with classical Factor Analysis GMM and i-vector reference systems, and evaluate them on two different tasks from the National Institute of Standards and Technology (NIST) Language Recognition Evaluation (LRE) 2009: one focused on language-pairs and the other, on multi-class language identification. Results shown comparable performance of the convolutional neural network based approaches and some improvements are achieved when fusing both classical and neural network approaches. We also present the experiments performed with LSTM recurrent neural networks, which have proven their ability to model time depending sequences. We evaluate our LSTM-based language recognition systems on different subsets of the NIST LRE 2009 and 2015, where LSTM systems are able to outperform the reference i-vector system, providing a model with less parameters, although more prone to overfitting and not able to generalize as well as i-vector in mismatched datasets. In the second experimental block of this Dissertation, we explore one of the most prominent applications of deep neural networks in speech processing, which is their use as feature extractors. In this kind of systems, deep neural networks are used to obtain a frame-by-frame representation of the speech signal, the so-called bottleneck feature vector, which is learned directly by the network and is then used instead of traditional acoustic features as input in language and speaker recognition systems based on i-vectors. This approach revolutionized these two fields, since they highly outperformed classical systems which had been state-of-the-art for many year (i-vector based on acoustic features). Our analysis focuses on how different configurations of the neural network used as bottleneck feature extractor, and which is trained for automatic speech recognition, influences performance of resulting features for language and speaker recognition. For the case of language recognition, we compare bottleneck features from networks that vary their depth in terms of number of hidden layers, the position of the bottleneck layer where it comprises the information and the number of units (size) of this layer, which would influence the representation obtained by the network. With the set of experiments performed on bottleneck features for speaker recognition, we analyzed the influence of the type of features used to feed the network, their pre-processing and, in general, the optimization of the network for the task of feature extraction for speaker recognition, which might not mean the optimal configuration for ASR. Finally, the third experimental block of this Thesis proposes a novel approach for language recognition, in which the neural network is used to extract a fixed-length utterance-level representation of speech segments known as embedding, able to replace the classical i-vector, and overcoming the variable length sequence of feature provided by the bottleneck features. This embedding based approach has recently shown promising results for speaker verification tasks, and our proposed system was able to outperform a strong state-of-the-art reference i-vector system on the last challenging language recognition evaluations organized by NIST in 2015 and 2017. Thus, we analyze language recognition systems based on embeddings, and explore different deep neural network architectures and data augmentation techniques to improve results of our system. In general, these embeddings are a fair competitor to the well-established i-vector pipeline which allows replacing the whole i-vector model by a deep neural network. Furthermore, the network is able to extract complementary information to the one contained in the i-vectors, even from the same input features. All this makes us consider that this contribution is an interesting research line to explore in other fields.
- Research Article
55
- 10.1371/journal.pone.0182580
- Aug 10, 2017
- PLoS ONE
Language recognition systems based on bottleneck features have recently become the state-of-the-art in this research field, showing its success in the last Language Recognition Evaluation (LRE 2015) organized by NIST (U.S. National Institute of Standards and Technology). This type of system is based on a deep neural network (DNN) trained to discriminate between phonetic units, i.e. trained for the task of automatic speech recognition (ASR). This DNN aims to compress information in one of its layers, known as bottleneck (BN) layer, which is used to obtain a new frame representation of the audio signal. This representation has been proven to be useful for the task of language identification (LID). Thus, bottleneck features are used as input to the language recognition system, instead of a classical parameterization of the signal based on cepstral feature vectors such as MFCCs (Mel Frequency Cepstral Coefficients). Despite the success of this approach in language recognition, there is a lack of studies analyzing in a systematic way how the topology of the DNN influences the performance of bottleneck feature-based language recognition systems. In this work, we try to fill-in this gap, analyzing language recognition results with different topologies for the DNN used to extract the bottleneck features, comparing them and against a reference system based on a more classical cepstral representation of the input signal with a total variability model. This way, we obtain useful knowledge about how the DNN configuration influences bottleneck feature-based language recognition systems performance.
- Research Article
397
- 10.1109/lsp.2015.2420092
- Oct 1, 2015
- IEEE Signal Processing Letters
The impressive gains in performance obtained using deep neural networks (DNNs) for automatic speech recognition (ASR) have motivated the application of DNNs to other speech technologies such as speaker recognition (SR) and language recognition (LR). Prior work has shown performance gains for separate SR and LR tasks using DNNs for direct classification or for feature extraction. In this work we present the application of single DNN for both SR and LR using the 2013 Domain Adaptation Challenge speaker recognition (DAC13) and the NIST 2011 language recognition evaluation (LRE11) benchmarks. Using a single DNN trained for ASR on Switchboard data we demonstrate large gains on performance in both benchmarks: a 55% reduction in EER for the DAC13 out-of-domain condition and a 48% reduction in ${C_{avg}}$ on the LRE11 30 s test condition. It is also shown that further gains are possible using score or feature fusion leading to the possibility of a single i-vector extractor producing state-of-the-art SR and LR performance
- Conference Article
19
- 10.21437/interspeech.2016-624
- Sep 8, 2016
The series of language recognition evaluations (LRE's) conducted by the National Institute of Standards and Technology (NIST) have been one of the driving forces in advancing spoken language recognition technology. This paper presents a shared view of five institutions resulting from our collaboration toward LRE 2015 submissions under the names of I2R, Fan-tastic4, and SingaMS. Among others, LRE'15 emphasizes on language detection in the context of closely related languages, which is different from previous LRE's. From the perspective of language recognition system design, we have witnessed a major paradigm shift in adopting deep neural network (DNN) for both feature extraction and classifier. In particular, deep bottleneck features (DBF) have a significant advantage in replacing the shifted-delta-cepstral (SDC) which has been the only option in the past. We foresee deep learning is going to serve as a major driving force in advancing spoken language recognition system in the coming years.
- Research Article
1
- 10.1186/s13636-014-0042-5
- Dec 1, 2014
- EURASIP Journal on Audio, Speech, and Music Processing
Currently, acoustic spoken language recognition (SLR) and phonotactic SLR systems are widely used language recognition systems. To achieve better performance, researchers combine multiple subsystems with the results often much better than a single SLR system. Phonotactic SLR subsystems may vary in the acoustic features vectors or include multiple language-specific phone recognizers and different acoustic models. These methods achieve good performance but usually compute at high computational cost. In this paper, a new diversification for phonotactic language recognition systems is proposed using vector space models by support vector machine (SVM) supervector reconstruction (SSR). In this architecture, the subsystems share the same feature extraction, decoding, and N-gram counting preprocessing steps, but model in a different vector space by using the SSR algorithm without significant additional computation. We term this a homogeneous ensemble phonotactic language recognition (HEPLR) system. The system integrates three different SVM supervector reconstruction algorithms, including relative SVM supervector reconstruction, functional SVM supervector reconstruction, and perturbing SVM supervector reconstruction. All of the algorithms are incorporated using a linear discriminant analysis-maximum mutual information (LDA-MMI) backend for improving language recognition evaluation (LRE) accuracy. Evaluated on the National Institute of Standards and Technology (NIST) LRE 2009 task, the proposed HEPLR system achieves better performance than a baseline phone recognition-vector space modeling (PR-VSM) system with minimal extra computational cost. The performance of the HEPLR system yields 1.39%, 3.63%, and 14.79% equal error rate (EER), representing 6.06%, 10.15%, and 10.53% relative improvements over the baseline system, respectively, for the 30-, 10-, and 3-s test conditions.
- Research Article
11
- 10.1109/taslp.2020.2964953
- Jan 1, 2020
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
Bottleneck features (BNFs) generated with a deep neural network (DNN) have proven to boost spoken language recognition accuracy over basic spectral features significantly. However, BNFs are commonly extracted using language-dependent tied-context phone states as learning targets. Moreover, BNFs are less phonetically expressive than the output layer in a DNN, which is usually not used as a speech feature because of its very high dimensionality hindering further post-processing. In this article, we put forth a novel deep learning framework to overcome all of the above issues and evaluate it on the 2017 NIST Language Recognition Evaluation (LRE) challenge. We use manner and place of articulation as speech attributes, which lead to low-dimensional “universal” phonetic features that can be defined across all spoken languages. To model the asynchronous nature of the speech attributes while capturing their intrinsic relationships in a given speech segment, we introduce a new training scheme for deep architectures based on a Maximal Figure of Merit (MFoM) objective. MFoM introduces non-differentiable metrics into the backpropagation-based approach, which is elegantly solved in the proposed framework. The experimental evidence collected on the recent NIST LRE 2017 challenge demonstrates the effectiveness of our solution. In fact, the performance of speech language recognition (SLR) systems based on spectral features is improved for more than 5% absolute Cavg. Finally, the F1 metric can be brought from 77.6% up to 78.1% by combining the conventional baseline phonetic BNFs with the proposed articulatory attribute features.
- Research Article
1
- 10.1186/1687-6180-2012-47
- Feb 27, 2012
- EURASIP Journal on Advances in Signal Processing
In this article, we propose a new feature which could be used for the framework of SVM-based language recognition, by introducing the idea of total variability used in speaker recognition to language recognition. We consider the new feature as low-dimensional representation of Gaussian mixture model supervector. Thus we propose multiple total variability (MTV) language recognition system based on total variability (TV) language recognition system. Our experiments show that the total factor vector includes the language dependent information; what's more, multiple total factor vector contains more language dependent information. Experimental results on 2007 National Institute of Standards and Technology (NIST) Language Recognition Evaluation (LRE) databases show that MTV outperforms TV in 30 s tasks, and both TV and MTV systems can achieve performance similar to that obtained by state-of-the-art approaches. Best performance of our acoustic language recognition systems can be further improved by combining these two new systems.
- Conference Article
2
- 10.1109/ijcnn.2017.7965951
- May 1, 2017
Recently, bottleneck features as effective representations have been successfully used in Speaker Recognition (SR) and Language Recognition (LR), but little work has focused on bottleneck features for Bird Species Verification (BSV). In SR, LR and BSR tasks, using short-time spectra features may be insufficient, so it need some more abstract and discriminative representations as complementation to conventional spectra features. Some SR and LR work shows that bottleneck features can form a low-dimension representation of the original inputs with a powerful descriptive and discriminative capability. Due to the general audio representation principles of speakers, language and birds being similar, we propose a hypothesis: the bottleneck features are also useful for BSV. Therefore, in this paper, we use the bottleneck feature framework based on the standard i-vector framework to deal with crucial problems in conventional methods of BSV, such as the session variability and insufficient features. Moreover, we make no distinction between bird calls and bird songs in the evaluation phase. Experimental results show that the standard i-vector system and the bottleneck feature system gain 3.39% and 0.85% Equal Error Rate (EER) respectively. The bottleneck feature system obtains 75% relative improvement over the standard i-vector system, meaning that the bottleneck features as a complementation to spectra features are significantly useful for BSV. The deep feature system, which is an another state-of-the-art framework based on deep features used in SR, however, only results in 18.64% EER, which is much worse than the other two systems, and a brief explanation is provided in this paper.
- Conference Article
40
- 10.21437/interspeech.2015-299
- Sep 6, 2015
Significant performance gains have been reported separately for speaker recognition (SR) and language recognition (LR) tasks using either DNN posteriors of sub-phonetic units or DNN feature representations, but the two techniques have not been compared on the same SR or LR task or across SR and LR tasks using the same DNN. In this work we present the application of a single DNN for both tasks using the 2013 Domain Adaptation Challenge speaker recognition (DAC13) and the NIST 2011 language recognition evaluation (LRE11) benchmarks. Using a single DNN trained on Switchboard data we demonstrate large gains in performance on both benchmarks: a 55% reduction in EER for the DAC13 out-of-domain condition and a 48% reduction in Cavg on the LRE11 30s test condition. Score fusion and feature fusion are also investigated as is the performance of the DNN technologies at short durations for SR. Index Terms: i-vector, DNN, bottleneck features, speaker recognition, language recognition
- Research Article
- 10.1007/s11265-015-1017-1
- May 28, 2015
- Journal of Signal Processing Systems
Currently, phonotactic spoken language recognition (SLR) and acoustic SLR systems are widely used language recognition systems. Parallel phone recognition followed by vector space modeling (PPRVSM) is one typical phonotactic system for spoken language recognition. To achieve better performance, researchers assumed to extract more complementary information of the training data using phone recognizers trained for multiple language-specific phone recognizers, different acoustic models and acoustic features. These methods achieve good performance but usually compute at high computational cost and only using complementary information of the training data. In this paper, we explore a novel approach to discriminative vector space model (VSM) training by using a boosting framework to use the discriminative information of test data effectively, in which an ensemble of VSMs is trained sequentially. The effectiveness of our boosting variation comes from the emphasis on working with the high confidence test data to achieve discriminatively trained models. Our variant of boosting also includes utilizing original training data in VSM training. The discriminative boosting algorithm (DBA) is applied to the National Institute of Standards and Technology (NIST) language recognition evaluation (LRE) 2009 task and show performance improvements. The experimental results demonstrate that the proposed DBA shows 1.8 %, 11.72 % and 15.35 % relative reduction for 30s, 10s and 3s test utterances in equal error rate (EER) than baseline system.
- Research Article
3
- 10.1016/j.csl.2014.09.003
- Sep 29, 2014
- Computer Speech & Language
Relevance factor of maximum a posteriori adaptation for GMM–NAP–SVM in speaker and language recognition
- Book Chapter
1
- 10.1007/978-981-99-1027-4_119
- Jan 1, 2023
Efficient and accurate prediction of battery remaining capacity can guarantee the safety and reliability of electric vehicles (EVs). However, battery capacity is difficult to measure directly due to complex application scenarios and sophisticated internal physicochemical reactions. This study develops a hybrid deep learning approach for accurate remaining capacity estimation based on differential temperature (DT) curve. First, the cycle life data are acquired and analyzed. Then, DT curves are deduced based on the charging data and smoothed via Kalman filter (KF). Next, health features (HFs) that characterize the battery degradation are excavated from the DT curves. Finally, a hybrid deep learning model fusion convolutional neural network (CNN) and gated recurrent unit (GRU) recurrent neural network (RNN) is established to predict battery remaining capacity. Each deep neural network (NN) in the model is engaged to execute a particular part in the forecasting task to maximize its corresponding merits. The superiority of the proposed method in terms of accuracy is justified via comparison with other modern methods including long short-term memory (LSTM) RNN, GRU RNN and a hybrid model integrating CNN and LSTM RNN. Experimental results demonstrate that the effectiveness and applicability of the proposed method in enabling battery remaining capacity estimation.
- Research Article
- 10.1121/1.4708077
- Apr 1, 2012
- The Journal of the Acoustical Society of America
In this paper, we try to integrate the concept of large margin gaussian mixture models (large margin GMMs) into discriminative training for language recognition. We proposed a new language recognition system (SVM-LM-ModelPushing system) which combines model pushing by large margin GMMs (LM-ModelPushing) with original model pushing by SVM (ModelPushing). Our experiments show that LM-ModelPushing includes the language dependent information. What's more, our experiments show that LM-ModelPushing contains different language dependent information comparing to ModelPushing. Experiment results on 2007 National Institute of Standards and Technology (NIST) language Recognition Evaluation (LRE) databases show SVM-LM-ModelPushing system gains relative improvement in EER of 9.1% and in minDCF of 8.8% comparing to original ModelPushing system in 30-second tasks.
- Conference Article
39
- 10.1109/icassp.2016.7472744
- Mar 1, 2016
Using bottleneck features extracted from a deep neural network (DNN) trained to predict senone posteriors has resulted in new, state-of-the-art technology for language and speaker identification. For language identification, the features' dense phonetic information is believed to enable improved performance by better representing language-dependent phone distributions. For speaker recognition, the role of these features is less clear, given that a bottleneck layer near the DNN output layer is thought to contain limited speaker information. In this article, we analyze the role of bottleneck features in these identification tasks by varying the DNN layer from which they are extracted, under the hypothesis that speaker information is traded for dense phonetic information as the layer moves toward the DNN output layer. Experiments support this hypothesis under certain conditions, and highlight the benefit of using a bottleneck layer close to the DNN output layer when DNN training data is matched to the evaluation conditions, and a layer more central to the DNN otherwise.
- Book Chapter
- 10.1007/978-3-642-37835-5_39
- Sep 22, 2013
In this paper, we introduce a new subspace learning algorithm in language recognition called locality preserving discriminant projection (LPDP). Total variability approach has been the state of art in language recognition, and it preserves most of the discriminant information of languages. Locality preserving projection (LPP) has been proved effective in language recognition, but it can only preserve the local structure of languages. LPDP method used in the total variability subspace can preserve both local structure and global discriminant information about the languages. Experiments are carried out on NIST 2011 Language Recognition Evaluation (LRE) database. The results indicate that LPDP language recognition system performs better than LPP language recognition system and total variability language recognition system in 30 s tasks. In addition, we also give the results of the total variability and LPDP language recognition systems on NIST 2007 LRE 30 s database.
- Research Article
10
- 10.1186/s13636-015-0066-5
- Aug 13, 2015
- EURASIP Journal on Audio, Speech, and Music Processing
Support vector machines (SVMs) have played an important role in the state-of-the-art language recognition systems. The recently developed extreme learning machine (ELM) tends to have better scalability and achieve similar or much better generalization performance at much faster learning speed than traditional SVM. Inspired by the excellent feature of ELM, in this paper, we propose a novel method called regularized minimum class variance extreme learning machine (RMCVELM) for language recognition. The RMCVELM aims at minimizing empirical risk, structural risk, and the intra-class variance of the training data in the decision space simultaneously. The proposed method, which is computationally inexpensive compared to SVM, suggests a new classifier for language recognition and is evaluated on the 2009 National Institute of Standards and Technology (NIST) language recognition evaluation (LRE). Experimental results show that the proposed RMCVELM obtains much better performance than SVM. In addition, the RMCVELM can also be applied to the popular i-vector space and get comparable results to the existing scoring methods.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.