Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Leda Sari,Mark Hasegawa-Johnson,Samuel Thomas

doi:10.1109/taslp.2020.3040626

Abstract

Speaker adaptation and speaker change detection have both been studied extensively to improve automatic speech recognition (ASR). In many cases, these two problems are investigated separately: speaker change detection is implemented first to obtain single-speaker regions, and speaker adaptation is then performed using the derived speaker segments for improved ASR. However, in an online setting, we want to achieve both goals in a single pass. In this study, we propose a neural network architecture that learns a speaker embedding from which it can perform both speaker adaptation for ASR and speaker change detection. The proposed speaker embedding is computed using self-attention based on an auxiliary network attached to a main ASR network. ASR adaptation is then performed by subtracting, from the main network activations, a segment dependent affine transformation of the learned speaker embedding. In experiments on a broadcast news dataset and the Switchboard conversational dataset, we test our system on utterances with a change point in them and show that the proposed method achieves significantly better performance as compared to the unadapted main network (10–14% relative reduction in word error rate (WER)). The proposed architecture also outperforms three different speaker segmentation methods followed by ASR (around 10% relative reduction in WER).

Highlights

A S IN many machine learning applications, automatic speech recognition (ASR) performance degrades on unseen data, especially on inputs from unseen speakers
Another study [17] that performs ASR on recordings with speaker change uses an i-vector memory learned from the training data along with the read mechanism of the neural Turing machine to extract i-vector like embeddings from the memory. This is an unsupervised adaptation strategy that allows for frame-level adaptation and it can handle speaker changes. It differs from the proposal in this paper primarily in that (1) in this paper, we propose training the speaker embedding using an ASR loss function, rather than a loss function incorporating any information about speaker identity, and (2) despite the lack of any speaker identity information during training, the method proposed in this paper makes it easy to obtain speaker change points during testing, by looking at the similarities between segments at the end of the auxiliary network
We have presented a principled approach to designing an auxiliary network architecture that can detect speaker changes and adapt to different speakers instead of having a speaker segmentation system followed by speaker adaptation

Summary

Introduction

A S IN many machine learning applications, automatic speech recognition (ASR) performance degrades on unseen data, especially on inputs from unseen speakers This is largely due to the significant acoustic variations found in speech signals produced by different individuals even when they speak the same words. Physical differences between individuals such as vocal tract length, and idiolectal differences such as region and social grouping, affect the way we speak These factors contribute to changes in prosody and segmental articulation along with other variations. Date of publication November 25, 2020; date of current version December 14, 2020. The associate editor coordinating the review of this manuscript and approving it for publication was Prof.

Methods

Findings

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Nov 25, 2020
Citations: 45	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Similar Papers

Subband Temporal Envelope Features and Data Augmentation for End-to-end Recognition of Distant Conversational Speech
Cong-Thanh Do
-
Cong-Thanh DoCong-Thanh Do
01 May 2019
01 May 2019

Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring
Qiujia Li ... Philip C Woodland
Speech Communication | VOL. 147
Qiujia Li, et. al.Qiujia Li ... Philip C Woodland
24 Dec 2022
Speech Communication | VOL. 147

Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition
Qiujia Li ... Philip C Woodland
SSRN Electronic Journal | VOL. -
Qiujia Li, et. al.Qiujia Li ... Philip C Woodland
01 Jan 2021
SSRN Electronic Journal | VOL. -

Towards Fast and Accurate Streaming End-To-End ASR
Bo Li ... Tara N Sainath
-
Bo Li, et. al.Bo Li ... Tara N Sainath
01 May 2020
01 May 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing