Recurrent Neural Network Transducer Research Articles

PurposeIn the COVID-19 era, sign language (SL) translation has gained attention in online learning, which evaluates the physical gestures of each student and bridges the communication gap between dysphonia and hearing people. The purpose of this paper is to devote the alignment between SL sequence and nature language sequence with high translation performance.Design/methodology/approachSL can be characterized as joint/bone location information in two-dimensional space over time, forming skeleton sequences. To encode joint, bone and their motion information, we propose a multistream hierarchy network (MHN) along with a vocab prediction network (VPN) and a joint network (JN) with the recurrent neural network transducer. The JN is used to concatenate the sequences encoded by the MHN and VPN and learn their sequence alignments.FindingsWe verify the effectiveness of the proposed approach and provide experimental results on three large-scale datasets, which show that translation accuracy is 94.96, 54.52, and 92.88 per cent, and the inference time is 18 and 1.7 times faster than listen-attend-spell network (LAS) and visual hierarchy to lexical sequence network (H2SNet) , respectively.Originality/valueIn this paper, we propose a novel framework that can fuse multimodal input (i.e. joint, bone and their motion stream) and align input streams with nature language. Moreover, the provided framework is improved by the different properties of MHN, VPN and JN. Experimental results on the three datasets demonstrate that our approaches outperform the state-of-the-art methods in terms of translation accuracy and speed.

Read full abstract

Speech has been a natural and effective way of communication, widely used in the field of information-communication and human–machine interaction. In recent years, various algorithms have been used for achieving efficient communication. The main purpose of automatic speech recognition (ASR), one of the key technologies in this field, is to convert the analog signals of input speech into corresponding text digital signals. Further, ASR can be divided into two categories: one based on hidden Markov model (HMM) and the other based on end to end (E2E) models. Compared with the former, E2E models have a simple modeling process and an easy training model and thus, research is carried out in the direction of developing E2E models for effectively using in ASR. However, HMM-based speech recognition technologies have some disadvantages in terms of prediction error rate, generalization ability, and convergence speed. Therefore, recurrent neural network–transducer (RNN–T), a typical E2E acoustic model that can model the dependencies between the outputs and can be optimized jointly with a Language Model (LM), was proposed in this study. Further, a new acoustic model of DL–T based on DenseNet (dense convolutional network)–LSTM (long short-term memory)–Transducer, was proposed to solve the problems of a high prediction error rate and slow convergence speed in a RNN–T. First, a RNN–T was briefly introduced. Then, combining the merits of both DenseNet and LSTM, a novel acoustic model of DL–T, was proposed in this study. A DL–T can extract high-dimensional speech features and alleviate gradient problems and it has the advantages of low character error rate (CER) and fast convergence speed. Apart from that, a transfer learning method suitable for a DL–T was also proposed. Finally, a DL–T was researched in speech recognition based on the Aishell–1 dataset for validating the abovementioned methods. The experimental results show that the relative CER of DL–T is reduced by 12.52% compared with RNN–T, and the final CER is 10.34%, which also demonstrates a low CER and better convergence speed of the DL–T.

Read full abstract

Recurrent Neural Network Transducer Research Articles

Related Topics

Articles published on Recurrent Neural Network Transducer

ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition.

An analog-AI chip for energy-efficient speech recognition and transcription

Savitar: an intelligent sign language translation approach for deafness and dysphonia in the COVID-19 era

ASR - VLSP 2021: Automatic Speech Recognition with Blank Label Re-weighting

An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer

A 2D Convolutional Gating Mechanism for Mandarin Streaming Speech Recognition

Research on automatic speech recognition based on a DL–T and transfer learning

Streaming End-to-End Multi-Talker Speech Recognition

Accelerating RNN Transducer Inference via Adaptive Expansion Search

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Recurrent Neural Network Transducer Research Articles

Related Topics

Articles published on Recurrent Neural Network Transducer

ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition.

An analog-AI chip for energy-efficient speech recognition and transcription

Savitar: an intelligent sign language translation approach for deafness and dysphonia in the COVID-19 era

ASR - VLSP 2021: Automatic Speech Recognition with Blank Label Re-weighting

An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer

A 2D Convolutional Gating Mechanism for Mandarin Streaming Speech Recognition

Research on automatic speech recognition based on a DL–T and transfer learning

Streaming End-to-End Multi-Talker Speech Recognition

Accelerating RNN Transducer Inference via Adaptive Expansion Search