Abstract
Speaker adaptation and speaker change detection have both been studied extensively to improve automatic speech recognition (ASR). In many cases, these two problems are investigated separately: speaker change detection is implemented first to obtain single-speaker regions, and speaker adaptation is then performed using the derived speaker segments for improved ASR. However, in an online setting, we want to achieve both goals in a single pass. In this study, we propose a neural network architecture that learns a speaker embedding from which it can perform both speaker adaptation for ASR and speaker change detection. The proposed speaker embedding is computed using self-attention based on an auxiliary network attached to a main ASR network. ASR adaptation is then performed by subtracting, from the main network activations, a segment dependent affine transformation of the learned speaker embedding. In experiments on a broadcast news dataset and the Switchboard conversational dataset, we test our system on utterances with a change point in them and show that the proposed method achieves significantly better performance as compared to the unadapted main network (10–14% relative reduction in word error rate (WER)). The proposed architecture also outperforms three different speaker segmentation methods followed by ASR (around 10% relative reduction in WER).
Highlights
A S IN many machine learning applications, automatic speech recognition (ASR) performance degrades on unseen data, especially on inputs from unseen speakers
Another study [17] that performs ASR on recordings with speaker change uses an i-vector memory learned from the training data along with the read mechanism of the neural Turing machine to extract i-vector like embeddings from the memory. This is an unsupervised adaptation strategy that allows for frame-level adaptation and it can handle speaker changes. It differs from the proposal in this paper primarily in that (1) in this paper, we propose training the speaker embedding using an ASR loss function, rather than a loss function incorporating any information about speaker identity, and (2) despite the lack of any speaker identity information during training, the method proposed in this paper makes it easy to obtain speaker change points during testing, by looking at the similarities between segments at the end of the auxiliary network
We have presented a principled approach to designing an auxiliary network architecture that can detect speaker changes and adapt to different speakers instead of having a speaker segmentation system followed by speaker adaptation
Summary
A S IN many machine learning applications, automatic speech recognition (ASR) performance degrades on unseen data, especially on inputs from unseen speakers This is largely due to the significant acoustic variations found in speech signals produced by different individuals even when they speak the same words. Physical differences between individuals such as vocal tract length, and idiolectal differences such as region and social grouping, affect the way we speak These factors contribute to changes in prosody and segmental articulation along with other variations. Date of publication November 25, 2020; date of current version December 14, 2020. The associate editor coordinating the review of this manuscript and approving it for publication was Prof.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.