Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

Chenglin Xu,Jibin Wu,Wei Rao,Haizhou Li

doi:10.1109/taslp.2021.3100682

Abstract

Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single- and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multi-task learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equal-error-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.

Highlights

T raditional speaker verification (SV) methods, such as i-vector [1]–[3] with probabilistic linear discriminant analysis (PLDA) [4], x-vector PLDA [5]–[7], assume that input speech is uttered by a single speaker
With the same single talker evaluation condition, the tSVFA method achieves the best performance of 2.63% (EER), 0.325 (DCF08) and 0.505 (DCF10) among the proposed four approaches
We present a unified target speaker verification framework tSV for both single- and multi-talker speech

Summary

Introduction

T raditional speaker verification (SV) methods, such as i-vector [1]–[3] with probabilistic linear discriminant analysis (PLDA) [4], x-vector PLDA [5]–[7], assume that input speech is uttered by a single speaker. These methods, degrade significantly in the presence of interference speakers. Speaker diarization technique seeks to inform ‘who spoke when?’ It segments the multi-talker speech temporally into speaker turns, and identifies speaker-overlapping segments [8]–[13]. The speaker diarization technique is helpful only if the speakers overlap sporadically, while it fails when the speakers are heavily overlapped in time

Methods

Results

Conclusion