Abstract
Language recognition systems based on bottleneck features have recently become the state-of-the-art in this research field, showing its success in the last Language Recognition Evaluation (LRE 2015) organized by NIST (U.S. National Institute of Standards and Technology). This type of system is based on a deep neural network (DNN) trained to discriminate between phonetic units, i.e. trained for the task of automatic speech recognition (ASR). This DNN aims to compress information in one of its layers, known as bottleneck (BN) layer, which is used to obtain a new frame representation of the audio signal. This representation has been proven to be useful for the task of language identification (LID). Thus, bottleneck features are used as input to the language recognition system, instead of a classical parameterization of the signal based on cepstral feature vectors such as MFCCs (Mel Frequency Cepstral Coefficients). Despite the success of this approach in language recognition, there is a lack of studies analyzing in a systematic way how the topology of the DNN influences the performance of bottleneck feature-based language recognition systems. In this work, we try to fill-in this gap, analyzing language recognition results with different topologies for the DNN used to extract the bottleneck features, comparing them and against a reference system based on a more classical cepstral representation of the input signal with a total variability model. This way, we obtain useful knowledge about how the DNN configuration influences bottleneck feature-based language recognition systems performance.
Highlights
The task of Language Recognition or Language Identification (LID) is defined as the task of identifying the language spoken in a given audio segment [1]
We evaluate the language recognition system in the test-development dataset described in Section Test Datasets, where we explore the influence of variations in the topology of the deep neural network (DNN), and, we show the results in the evaluation dataset of the Language Recognition Evaluation (LRE) 2015
It is very interesting to see that the system which gives the best performance in terms of phoneme frame accuracy does not lead to a better bottleneck feature extractor for LID
Summary
The task of Language Recognition or Language Identification (LID) is defined as the task of identifying the language spoken in a given audio segment [1]. Automatic systems for LID aim to perform this task automatically, learning from a given dataset the necessary parameters to identify new spoken data. There are multiple applications of this technology as, for example, call centers that need to classify a call according to the language spoken, speech processing systems that deal with multilingual inputs, multimedia content indexing, or security applications such as tracking people depending on their language or accent. An analysis of DNN topology in bottleneck based language recognition partir de la Voz (TEC2015-68172-C2-1-P). Both projects are funded by Ministerio de Economıa y Competitividad, Spain. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.