Performance Of Automatic Speech Recognition Systems Research Articles

Automatic speech recognition (ASR) systems that facilitate voice based search and information retrieval have gained importance recently. While the performance of ASR systems for Indian languages have improved in the recent past. They have yet to gain wide acceptability as much as the ASR systems for English spoken by Indians. Almost all Indians learn English as a second or third language. So, the phoneme set and the prosody of native language of Indians influences the acoustic characteristics of spoken English. Since Indians speak a wide variety of languages, the acoustic characteristics of English spoken by Indians vary a lot. Thus, the recognition accuracy of Indian English could be improved by employing native language dependent English ASR systems. This approach requires automatic identification of the native language of the speaker. Here, we report the performance of an automatic Native Language Identification (NLI) system that recognises the native language of the speaker as Assamese or Bengali or Bodo after analysis of an English sentence spoken by the speaker. Training and performance evaluation of a NLI system needs appropriate linguistic resources. These include (a) speech data, in each of the 3 languages from several speakers, (b) corresponding word level transcriptions and (c) a pronunciation dictionary. While pronunciation dictionaries for English language are freely available, spoken English by speakers of the above-mentioned three languages and transcriptions are not publicly available. So, we created a relevant speech database. We recorded English spoken by native speakers, both male and female, of these three scheduled languages. Each speaker read 100 sentences out of a set of 700 English sentences; these were either proverbs or digit sequences. Each sentence contained 5 to 8 words. The digitised speech, recorded under ambient conditions using a laptop, had the following characteristics: 16000 Hz, 16 bit, mono. The database contains spoken English from 35 native Assamese speakers, 33 Bengali and 30 Bodo speakers. In order to carry out a threefold evaluation of the performance of the system, the speakers from each language were grouped into 3 subsets such that each subset contains nearly equal number of speakers. In each fold, one subset was designated as test data, and the remaining two subsets were used to train the system. We used Kaldi, an open source ASR toolkit, for implementation of the NLI system. As a first step in the development of NLI system, we implemented three English ASR systems, each trained using training data from one of the three languages: Assamese, Bengali and Bodo. A three-state Hidden Markov Model (HMM) represented a phone. Each state of HMM was associated with a Gaussian mixture model. We used Mel frequency cepstral coefficients and their temporal derivatives as features, and bigram as the language model. In order to identify the native language of a speaker, the test speech file was fed to each of the three ASR systems. An ASR system not only generates the decoded word sequence, but also the corresponding log likelihood. The NLI system follows the maximum likelihood criterion. The language corresponding to the ASR system that yielded the highest likelihood for the test speech was declared as the native language of the speaker. The overall accuracy of the NLI system was computed as the unweighted average recall, computed from the confusion matrix. The NLI accuracy of the system, averaged over threefold cross evaluations, was 59% for test speech of just 3 seconds. The confusion was largest among Assamese and Bengali languages as both are close members of Indo-Aryan language family. In contrast, Bodo belongs to the Sino-Tibetan language family. We discuss the performance of the NLI system using different models such as context-dependent and context independent HMMs, employing Gaussian mixture model or deep neural network to estimate the likelihood of a feature vector emitted from a state of HMM. Keywords: Automatic identification, automatic speech recognition, native language identification, voice-based search, information retrieval

Read full abstract

In this paper, we present a unified approach to transfer learning of deep neural networks (DNNs) to address performance degradation issues caused by a potential acoustic mismatch between the training and testing conditions due to inter-speaker variability in state-of-the-art connectionist (a.k.a., hybrid) automatic speech recognition (ASR) systems. Different schemes to transfer knowledge of deep neural networks related to speaker adaptation can be developed with ease under such a unifying concept as demonstrated in the three frameworks investigated in this study. In the first solution, knowledge is transferred between homogeneous domains, namely the source and the target domains. Moreover the transfer takes place in a sequential manner from the target to the source speaker to boost the ASR accuracy on spoken utterances from a surprise target speaker. In the second solution, a multi-task approach is adopted to adjust the connectionist parameters to improve the ASR system performance on the target speaker. Knowledge is transferred simultaneously among heterogeneous tasks, and that is achieved by adding one or more smaller auxiliary output layers to the original DNN structure. In the third solution, DNN output classes are organised into a hierarchical structure in order to adjust the connectionist parameters and close the gap between training and testing conditions by transferring prior knowledge from the root node to the leaves in a structural maximum a posteriori fashion. Through a series of experiments on the Wall Street Journal (WSJ) speech recognition task, we show that the proposed solutions result in consistent and statistically significant word error rate reductions. Most importantly, we show that transfer learning is an enabling technology for speaker adaptation, since it outperforms both the transformation-based adaptation algorithms usually adapted in the speech community, and the multi-condition training (MCT) schemes, which is a data combination methods often adopted to cover more acoustic variabilities in speech when data from the source and target domains are both available at the training time. Finally, experimental evidence demonstrates that all proposed solutions are robust to negative transfer even when only a single sentence from the target speaker is available.

Read full abstract

Performance Of Automatic Speech Recognition Systems Research Articles

Related Topics

Articles published on Performance Of Automatic Speech Recognition Systems

Quantification of Automatic Speech Recognition System Performance on d/Deaf and Hard of Hearing Speech.

Exploiting beam search confidence for energy-efficient speech recognition

A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech Recognition

Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language

Estimate the noise effect on automatic speech recognition accuracy for mandarin by an approach associating articulation index

Data Augmentation Using Spectral Warping for Low Resource Children ASR

Exploiting Beam Search Confidence for Energy-Efficient Speech Recognition

Out Domain Data Augmentation on Punjabi Children Speech Recognition using Tacotron

A CAUSAL DEEP LEARNING FRAMEWORK FOR CLASSIFYING PHONEMES IN COCHLEAR IMPLANTS.

Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit

Native Language Identification from Spoken Indian English

Designing of Gabor filters for spectro-temporal feature extraction to improve the performance of ASR system

Continuous Punjabi speech recognition model based on Kaldi ASR toolkit

Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system

A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition

Stereo-based histogram equalization for robust speech recognition

Integration of complex language models in ASR and LU systems

A new framework for robust speech recognition in complex channel environments

Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers

Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Speech Recognition

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Performance Of Automatic Speech Recognition Systems Research Articles

Related Topics

Articles published on Performance Of Automatic Speech Recognition Systems

Quantification of Automatic Speech Recognition System Performance on d/Deaf and Hard of Hearing Speech.

Exploiting beam search confidence for energy-efficient speech recognition

A Comparative Study on Selecting Acoustic Modeling Units for WFST-based Mongolian Speech Recognition

Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language

Estimate the noise effect on automatic speech recognition accuracy for mandarin by an approach associating articulation index

Data Augmentation Using Spectral Warping for Low Resource Children ASR

Exploiting Beam Search Confidence for Energy-Efficient Speech Recognition

Out Domain Data Augmentation on Punjabi Children Speech Recognition using Tacotron

A CAUSAL DEEP LEARNING FRAMEWORK FOR CLASSIFYING PHONEMES IN COCHLEAR IMPLANTS.

Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit

Native Language Identification from Spoken Indian English

Designing of Gabor filters for spectro-temporal feature extraction to improve the performance of ASR system

Continuous Punjabi speech recognition model based on Kaldi ASR toolkit

Constructing accurate and robust HMM/GMM models for an Arabic speech recognition system

A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition

Stereo-based histogram equalization for robust speech recognition

Integration of complex language models in ASR and LU systems

A new framework for robust speech recognition in complex channel environments

Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers

Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Speech Recognition