Relative Character Error Rate Research Articles

Speech has been a natural and effective way of communication, widely used in the field of information-communication and human–machine interaction. In recent years, various algorithms have been used for achieving efficient communication. The main purpose of automatic speech recognition (ASR), one of the key technologies in this field, is to convert the analog signals of input speech into corresponding text digital signals. Further, ASR can be divided into two categories: one based on hidden Markov model (HMM) and the other based on end to end (E2E) models. Compared with the former, E2E models have a simple modeling process and an easy training model and thus, research is carried out in the direction of developing E2E models for effectively using in ASR. However, HMM-based speech recognition technologies have some disadvantages in terms of prediction error rate, generalization ability, and convergence speed. Therefore, recurrent neural network–transducer (RNN–T), a typical E2E acoustic model that can model the dependencies between the outputs and can be optimized jointly with a Language Model (LM), was proposed in this study. Further, a new acoustic model of DL–T based on DenseNet (dense convolutional network)–LSTM (long short-term memory)–Transducer, was proposed to solve the problems of a high prediction error rate and slow convergence speed in a RNN–T. First, a RNN–T was briefly introduced. Then, combining the merits of both DenseNet and LSTM, a novel acoustic model of DL–T, was proposed in this study. A DL–T can extract high-dimensional speech features and alleviate gradient problems and it has the advantages of low character error rate (CER) and fast convergence speed. Apart from that, a transfer learning method suitable for a DL–T was also proposed. Finally, a DL–T was researched in speech recognition based on the Aishell–1 dataset for validating the abovementioned methods. The experimental results show that the relative CER of DL–T is reduced by 12.52% compared with RNN–T, and the final CER is 10.34%, which also demonstrates a low CER and better convergence speed of the DL–T.

Read full abstract

Recently, the hybrid convolutional neural network hidden Markov model (CNN-HMM) has been introduced for offline handwritten Chinese text recognition (HCTR) and has achieved state-of-the-art performance. However, modeling each of the large vocabulary of Chinese characters with a uniform and fixed number of hidden states requires high memory and computational costs and makes the tens of thousands of HMM state classes confusing. Another key issue of CNN-HMM for HCTR is the diversified writing style, which leads to model strain and a significant performance decline for specific writers. To address these issues, we propose a writer-aware CNN based on parsimonious HMM (WCNN-PHMM). First, PHMM is designed using a data-driven state-tying algorithm to greatly reduce the total number of HMM states, which not only yields a compact CNN by state sharing of the same or similar radicals among different Chinese characters but also improves the recognition accuracy due to the more accurate modeling of tied states and the lower confusion among them. Second, WCNN integrates each convolutional layer with one adaptive layer fed by a writer-dependent vector, namely, the writer code, to extract the irrelevant variability in writer information to improve recognition performance. The parameters of writer-adaptive layers are jointly optimized with other network parameters in the training stage, while a multiple-pass decoding strategy is adopted to learn the writer code and generate recognition results. Validated on the ICDAR 2013 competition of CASIA-HWDB database, the more compact WCNN-PHMM of a 7360-class vocabulary can achieve a relative character error rate (CER) reduction of 16.6% over the conventional CNN-HMM without considering language modeling. By adopting a powerful hybrid language model (N-gram language model and recurrent neural network language model), the CER of WCNN-PHMM is reduced to 3.17%. Moreover, the state-tying results of PHMM explicitly show the information sharing among similar characters and the confusion reduction of tied state classes. Finally, we visualize the learned writer codes and demonstrate the strong relationship with the writing styles of different writers. To the best of our knowledge, WCNN-PHMM yields the best results on the ICDAR 2013 competition set, demonstrating its power when enlarging the size of the character vocabulary.

Read full abstract

Relative Character Error Rate Research Articles

Related Topics

Articles published on Relative Character Error Rate

Grammar-Supervised End-to-End Speech Recognition with Part-of-Speech Tagging and Dependency Parsing

A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition

Optimizing Data Usage for Low-Resource Speech Recognition

An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer

Low‐latency transformer model for streaming automatic speech recognition

Deep neural network-based generalized sidelobe canceller for dual-channel far-field speech recognition

Research on automatic speech recognition based on a DL–T and transfer learning

Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition

Improving Low-Resource Speech Recognition Based on Improved NN-HMM Structures

Cross-Domain Deep Visual Feature Generation for Mandarin Audio–Visual Speech Recognition

Writer-aware CNN for parsimonious HMM-based offline handwritten Chinese text recognition

Adversarial Regularization for Attention Based End-to-End Robust Speech Recognition

Multi-domain adversarial training of neural network acoustic models for distant speech recognition

Bone-conducted speech enhancement using deep denoising autoencoder

Deep neural network acoustic models for spoken assessment applications

Improved syllable-based acoustic modeling for continuous Chinese speech recognition

Cross-Lingual Language Modeling for Low-Resource Speech Recognition

Improved acoustic models for spontaneous speech recognition

An Innovative Prosody Modeling Method for Chinese Speech Recognition

Tone Modeling for Continuous Mandarin Speech Recognition

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Relative Character Error Rate Research Articles

Related Topics

Articles published on Relative Character Error Rate

Grammar-Supervised End-to-End Speech Recognition with Part-of-Speech Tagging and Dependency Parsing

A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition

Optimizing Data Usage for Low-Resource Speech Recognition

An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer

Low‐latency transformer model for streaming automatic speech recognition

Deep neural network-based generalized sidelobe canceller for dual-channel far-field speech recognition

Research on automatic speech recognition based on a DL–T and transfer learning

Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition

Improving Low-Resource Speech Recognition Based on Improved NN-HMM Structures

Cross-Domain Deep Visual Feature Generation for Mandarin Audio–Visual Speech Recognition

Writer-aware CNN for parsimonious HMM-based offline handwritten Chinese text recognition

Adversarial Regularization for Attention Based End-to-End Robust Speech Recognition

Multi-domain adversarial training of neural network acoustic models for distant speech recognition

Bone-conducted speech enhancement using deep denoising autoencoder

Deep neural network acoustic models for spoken assessment applications

Improved syllable-based acoustic modeling for continuous Chinese speech recognition

Cross-Lingual Language Modeling for Low-Resource Speech Recognition

Improved acoustic models for spontaneous speech recognition

An Innovative Prosody Modeling Method for Chinese Speech Recognition

Tone Modeling for Continuous Mandarin Speech Recognition