Character Error Rate Research Articles

The work presented in this paper aims at enhancing the performance of end-to-end (E2E) speech recognition task for children's speech under low resource conditions. For majority of the languages, there is hardly any speech data from child speakers. Furthermore, even the available children's speech corpora are limited in terms of the number of hours of data. On the other hand, large amounts of adults' speech data are freely available for research as well as commercial purposes. As a consequence, developing an effective E2E automatic speech recognition (ASR) system for children becomes a very challenging task. One may develop an ASR system using adults' speech and then use it to transcribe children's data, but this leads to very poor recognition rates due to the stark differences in the acoustic attributes of adults' and children's speech. In order to overcome these hurdles and to develop a robust children's ASR system employing E2E architecture, we have resorted to several out-of-domain and in-domain data augmentation techniques. For out-of-domain data augmentation, we have explicitly modified adults' speech to render it acoustically similar to that of children's speech before pooling into training. On the other hand, in the case of in-domain data augmentation, we have slightly modified the pitch and duration of children's speech in order to create more data capturing greater diversity. Data augmentation approaches helps in mitigating the ill-effects resulting from the scarcity of data from child domain to a certain extent. This, in turn, reduces the error rates by a large margin. In addition to data augmentation, we have also studied the efficacy of Gamma-tone frequency cepstral coefficients (GFCC) and frequency domain linear prediction (FDLP) technique along with the most commonly used Mel-frequency cepstral coefficients (MFCC) for front-end speech parameterization. Both MFCC as well as GFCC capture and model the spectral envelope of speech. On the other hand, application of linear prediction on the frequency domain representation of speech signal helps to effectively capture the temporal envelope during front-end feature extraction. Employing FDLP features that model the temporal envelope provides important cues for the perception and understanding of stop bursts and, at times, complete phonemes. This motivated us to perform a comparative experimental study of the effectiveness of the three aforementioned front-end acoustic features. In our experimental explorations, the use of proposed data augmentation in combination of FDLP features has shown a relative improvement in character error rate by 67.6% over the baseline system. The combination of data augmentation with MFCC or GFCC features is observed to result in lower recognition performances.

Read full abstract

In this study, a novel technique is proposed to recognize printed text in images for Urdu, a low-resource language with a scarcity of benchmark datasets. The proposed technique is called Efficient CRNN which uses depthwise separable convolutional and bidirectional gated recurrent unit layers, followed by connectionist temporal classification loss. The proposed technique is computationally more efficient than the existing text recognition techniques, requiring fewer parameters and computations. A multi-font printed Urdu text lines corpus is also presented, consisting of 245,000 text line images rendered using 7 different fonts. The corpus is called the MMU-Extension-22 and is used to train and evaluate existing state-of-the-art end-to-end text recognition techniques. Efficient CRNN is also evaluated using the proposed corpus. The proposed technique is first trained using a total of 196,000 text line images and then tested using 49,000 images. The Efficient CRNN technique achieved the minimum character and word error rates of 0.91% and 1.49% respectively for Urdu text line recognition under different settings, outperforming the existing computationally more complex techniques. The simple nature of the proposed technique not only makes it more efficient but also more robust for Urdu text line recognition, achieving a 2.23% reduced character error rate and a 71%11Percentage Decrease = 100 * (Baseline Value - Changed Value)/Baseline Value. decrease in character error rate as compared to the best performing existing Recurrent Neural Networks based technique. Also, the proposed technique outperforms Vision Transformer-based network achieving a 0.79% reduced character error rate accounting for a 41% decrease in error. Also, the Efficient CRNN has 49.16% reduced parameters compared to the baseline Vision Transformer technique.

Read full abstract

Character Error Rate Research Articles

Related Topics

Articles published on Character Error Rate

Deep learning-based recognition system for pashto handwritten text: benchmark on PHTI.

Sampleformer: An efficient conformer-based Neural Network for Automatic Speech Recognition

Text Recognition for Library Collection in Different Light Conditions

KMSAV: Korean multi‐speaker spontaneous audiovisual dataset

Developing children's ASR system under low-resource conditions using end-to-end architecture

Oral Voice Recognition System Based on Deep Neural Network Posteriori Probability Algorithm

Deep Learning-Based Analysis of Ancient Greek Literary Texts in English Version: A Statistical Model Based on Word Frequency and Noise Probability for the Classification of Texts

Deep Learning for Accurate Recognition of Arabic Handwritten Words in Historical Documents.

ARABIC SOFT SPELLING CORRECTION WITH T5

A hybrid model for Arabic Script Recognition based on CNN-CBAM and BLSTM

NEURAL NETWORK ARCHITECTURE FOR TEXT DECODING BASED ON SPEAKER'S LIP MOVEMENTS

Offline Handwritten Text Extraction and Recognition Using CNN-BLSTM-CTC Network

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Gated Convolution and Stacked Self-Attention Encoder–Decoder-Based Model for Offline Handwritten Ethiopic Text Recognition

The Challenges of HTR Model Training: Feedback from the Project Donner le gout de l'archive a l'ere numerique

Task-based Meta Focal Loss for Multilingual Low-resource Speech Recognition

Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent units

Online Mongolian Handwriting Recognition Based on Encoder–Decoder Structure with Language Model

Automatic optical inspection for detecting keycaps misplacement using Tesseract optical character recognition

Radio2Text

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Character Error Rate Research Articles

Related Topics

Articles published on Character Error Rate

Deep learning-based recognition system for pashto handwritten text: benchmark on PHTI.

Sampleformer: An efficient conformer-based Neural Network for Automatic Speech Recognition

Text Recognition for Library Collection in Different Light Conditions

KMSAV: Korean multi‐speaker spontaneous audiovisual dataset

Developing children's ASR system under low-resource conditions using end-to-end architecture

Oral Voice Recognition System Based on Deep Neural Network Posteriori Probability Algorithm

Deep Learning-Based Analysis of Ancient Greek Literary Texts in English Version: A Statistical Model Based on Word Frequency and Noise Probability for the Classification of Texts

Deep Learning for Accurate Recognition of Arabic Handwritten Words in Historical Documents.

ARABIC SOFT SPELLING CORRECTION WITH T5

A hybrid model for Arabic Script Recognition based on CNN-CBAM and BLSTM

NEURAL NETWORK ARCHITECTURE FOR TEXT DECODING BASED ON SPEAKER'S LIP MOVEMENTS

Offline Handwritten Text Extraction and Recognition Using CNN-BLSTM-CTC Network

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Gated Convolution and Stacked Self-Attention Encoder–Decoder-Based Model for Offline Handwritten Ethiopic Text Recognition

The Challenges of HTR Model Training: Feedback from the Project Donner le gout de l'archive a l'ere numerique

Task-based Meta Focal Loss for Multilingual Low-resource Speech Recognition

Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent units

Online Mongolian Handwriting Recognition Based on Encoder–Decoder Structure with Language Model

Automatic optical inspection for detecting keycaps misplacement using Tesseract optical character recognition

Radio2Text