An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer

Jiatong Shi,Chunlei Zhang,Chao Weng,Shinji Watanabe,Meng Yu,Dong Yu

doi:10.1016/j.csl.2021.101327

Abstract

Target-speaker speech recognition aims to recognize the speech of an enrolled speaker from an environment with background noise and interfering speakers. This study presents a joint framework that combines time-domain target speaker extraction and recurrent neural network transducer (RNN-T) for speech recognition. To alleviate the adverse effects of residual noise and artifacts introduced by the target speaker extraction module to the speech recognition back-end, we explore to training the target speaker extraction and RNN-T jointly. We find a multi-stage training strategy that pre-trains and fine-tunes each module before joint training is crucial in stabilizing the training process. In addition, we propose a novel neural uncertainty estimation that leverages useful information from the target speaker extraction module to further improve the back-end speech recognizer (i.e., speaker identity uncertainty and speech enhancement uncertainty). Compared to a recognizer with target speech extraction front-end, our experiments show that joint-training and the neural uncertainty module reduce 7% and 17% relative character error rate (CER) on multi-talker simulation data, respectively. The multi-condition experiments indicate that our method can reduce 9% relative CER in the noisy condition without losing performance in the clean condition. We also observe consistent improvements in further evaluation of real-world data based on vehicular speech.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer

Abstract

Talk to us

Similar Papers

More From: Computer Speech & Language

Lead the way for us

Journal: Computer Speech & Language	Publication Date: Dec 2, 2021
Citations: 1

Similar Papers

Improving RNN Transducer with Target Speaker Extraction and Neural Uncertainty Estimation
Jiatong Shi ... Meng Yu
-
Jiatong Shi, et. al.Jiatong Shi ... Meng Yu
06 Jun 2021
06 Jun 2021

Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers
Juntae Kim ... Jeehye Lee
-
Juntae Kim, et. al.Juntae Kim ... Jeehye Lee
18 Sep 2022
18 Sep 2022

Self-attention Aligner: A Latency-control End-to-end Model for ASR Using Self-attention Network and Chunk-hopping
Linhao Dong ... Feng Wang
-
Linhao Dong, et. al.Linhao Dong ... Feng Wang
01 May 2019
01 May 2019

Research on automatic speech recognition based on a DL–T and transfer learning
...
工程科学学报 | VOL. 43
, et. al. ...
26 Mar 2021
工程科学学报 | VOL. 43

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer

Abstract

Talk to us

Similar Papers

More From: Computer Speech &amp; Language

More From: Computer Speech & Language