Word Error Rate Reduction Research Articles

In this paper, a new two-step joint optimization approach based on the asynchronous subregion optimization method is proposed for training a pipeline model composed of two different models. The first-step processing of the proposed joint optimization approach trains the front-end model only, and the second-step processing trains all the parameters of the combined model together. In the asynchronous subregion optimization method, the first-step processing only supports the goal of the front-end model. However, the first-step processing of the proposed approach works with a new loss function to make the front-end model support the goal of the back-end model. The proposed optimization approach was applied, here, to a pipeline composed of a deep complex convolutional recurrent network (DCCRN)-based speech enhancement model and a conformer-transducer-based ASR model as a front-end and a back-end, respectively. Then, the performance of the proposed two-step joint optimization approach was evaluated on the LibriSpeech automatic speech recognition (ASR) corpus in noisy environments by measuring the character error rate (CER) and word error rate (WER). In addition, an ablation study was carried out to examine the effectiveness of the proposed optimization approach on each of the processing blocks in the conformer-transducer ASR model. Consequently, it was shown from the ablation study that the conformer-transducer-based ASR model with the joint network trained only by the proposed optimization approach achieved the lowest average CER and WER. Moreover, the proposed optimization approach reduced the average CER and WER on the Test-Noisy dataset under matched noise conditions by 0.30% and 0.48%, respectively, compared to the approach of separate optimization of speech enhancement and ASR. Compared to the conventional two-step joint optimization approach, the proposed optimization approach provided average CER and WER reductions of 0.22% and 0.31%, respectively. Moreover, it was revealed that the proposed optimization approach achieved a lower average CER and WER, by 0.32% and 0.43%, respectively, than the conventional optimization approach under mismatched noise conditions.

Read full abstract

Background: In India, thousands of languages or dialects are in use. Most Indian dialects are low asset dialects. A well-performing Automatic Speech Recognition (ASR) system for Indian languages is unavailable due to a lack of resources. Hindi is one of them as large vocabulary Hindi speech datasets are not freely available. We have only a few hours of transcribed Hindi speech dataset. There is a lot of time and money involved in creating a well-transcribed speech dataset. Thus, developing a real-time ASR system with a few hours of the training dataset is the most challenging task. The different techniques like data augmentation, semi-supervised training, multilingual architecture, and transfer learning, have been reported in the past to tackle the fewer speech data issues. In this paper, we examine the effect of multilingual acoustic modeling in ASR systems for the Hindi language. Objective: This article’s objective is to develop a high accuracy Hindi ASR system with a reasonable computational load and high accuracy using a few hours of training data. Method: To achieve this goal we used Multilingual training with Time Delay Neural Network- Bidirectional Long Short Term Memory (TDNN-BLSTM) acoustic modeling. Multilingual acoustic modeling has significantly improved the ASR system's performance for low and limited resource languages. The common practice is to train the acoustic model by merging data from similar languages. In this work, we use three Indian languages, namely Hindi, Marathi, and Bengali. Hindi with 2.5 hours of training data and Marathi with 5.5 hours of training data and Bengali with 28.5 hours of transcribed data, was used in this work to train the proposed model. Results: The Kaldi toolkit was used to perform all the experiments. The paper is investigated over three main points. First, we present the monolingual ASR system using various Neural Network (NN) based acoustic models. Second, we show that Recurrent Neural Network (RNN) language modeling helps to improve the ASR performance further. Finally, we show that a multilingual ASR system significantly reduces the Word Error Rate (WER) (absolute 2% WER reduction for Hindi and 3% for the Marathi language). In all the three languages, the proposed TDNN-BLSTM-A multilingual acoustic models help to get the lowest WER. Conclusion: The multilingual hybrid TDNN-BLSTM-A architecture shows a 13.67% relative improvement over the monolingual Hindi ASR system. The best WER of 8.65% was recorded for Hindi ASR. For Marathi and Bengali, the proposed TDNN-BLSTM-A acoustic model reports the best WER of 30.40% and 10.85%.

Read full abstract

Word Error Rate Reduction Research Articles

Related Topics

Articles published on Word Error Rate Reduction

Data Augmentation Using Spectral Warping for Low Resource Children ASR

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition.

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition.

Handwritten text generation and strikethrough characters augmentation

Low-resource automatic speech recognition and error analyses of oral cancer speech

An OCR Post-Correction Approach Using Deep Learning for Processing Medical Reports

Automatic Speech Recognition Performance Improvement for Mandarin Based on Optimizing Gain Control Strategy.

Multilingual speech recognition for GlobalPhone languages

Writer adaptation for E2E Arabic online handwriting recognition via adversarial multi task learning

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party

Towards Contextual Spelling Correction for Customization of End-to-End Speech Recognition Systems

Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks

Optimizing Data Usage for Low-Resource Speech Recognition

An Investigation of Multilingual TDNN-BLSTM Acoustic Modeling for Hindi Speech Recognition

Non-diacritized Arabic speech recognition based on CNN-LSTM and attention-based models

Low‐latency transformer model for streaming automatic speech recognition

Out Domain Data Augmentation on Punjabi Children Speech Recognition using Tacotron

Data Augmentation for Arabic Speech Recognition Based on End-to-End Deep Learning

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Word Error Rate Reduction Research Articles

Related Topics

Articles published on Word Error Rate Reduction

Data Augmentation Using Spectral Warping for Low Resource Children ASR

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition.

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition.

Handwritten text generation and strikethrough characters augmentation

Low-resource automatic speech recognition and error analyses of oral cancer speech

An OCR Post-Correction Approach Using Deep Learning for Processing Medical Reports

Automatic Speech Recognition Performance Improvement for Mandarin Based on Optimizing Gain Control Strategy.

Multilingual speech recognition for GlobalPhone languages

Writer adaptation for E2E Arabic online handwriting recognition via adversarial multi task learning

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party

Towards Contextual Spelling Correction for Customization of End-to-End Speech Recognition Systems

Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks

Optimizing Data Usage for Low-Resource Speech Recognition

An Investigation of Multilingual TDNN-BLSTM Acoustic Modeling for Hindi Speech Recognition

Non-diacritized Arabic speech recognition based on CNN-LSTM and attention-based models

Low‐latency transformer model for streaming automatic speech recognition

Out Domain Data Augmentation on Punjabi Children Speech Recognition using Tacotron

Data Augmentation for Arabic Speech Recognition Based on End-to-End Deep Learning