USEV: Universal Speaker Extraction With Visual Cue

Zexu Pan,Haizhou Li,Meng Ge

doi:10.1109/taslp.2022.3205759

Abstract

A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-talker speech mixture. The prior studies focus mostly on speaker extraction from a highly overlapped multi-talker speech mixture. However, the target-interference speaker overlapping ratios could vary over a wide range from 0% to 100% in natural speech communication, furthermore, the target speaker could be absent in the speech mixture, the speech mixtures in such universal multi-talker scenarios are described as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">general speech mixtures</i> . The speaker extraction algorithm requires an auxiliary reference, such as a video recording or a pre-recorded speech, to form top-down auditory attention on the target speaker. We advocate that a visual cue, i.e., lip movement, is more informative than an audio cue, i.e., pre-recorded speech, to serve as the auxiliary reference for speaker extraction in disentangling the target speaker from a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">general speech mixture</i> . In this paper, we propose a universal speaker extraction network with a visual cue, that works for all multi-talker scenarios. In addition, we propose a scenario-aware differentiated loss function for network training, to balance the network performance over different target-interference speaker pairing scenarios. The experimental results show that our proposed method outperforms various competitive baselines for <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">general speech mixtures</i> in terms of signal fidelity.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2022
Citations: 14	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

USEV: Universal Speaker Extraction With Visual Cue

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Similar Papers

ImagineNet: Target Speaker Extraction with Intermittent Visual Cue Through Embedding Inpainting
Zexu Pan ... Wupeng Wang
-
Zexu Pan, et. al.Zexu Pan ... Wupeng Wang
04 Jun 2023
04 Jun 2023

Selective Listening by Synchronizing Speech With Lips
Zexu Pan ... Ruijie Tao
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30
Zexu Pan, et. al.Zexu Pan ... Ruijie Tao
01 Jan 2021
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 30

L-SpEx: Localized Target Speaker Extraction
Meng Ge ... Haizhou Li
-
Meng Ge, et. al.Meng Ge ... Haizhou Li
23 May 2022
23 May 2022

SpEx: Multi-Scale Time Domain Speaker Extraction Network
Chenglin Xu ... Wei Rao
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 28
Chenglin Xu, et. al.Chenglin Xu ... Wei Rao
01 Jan 2020
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 28

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

USEV: Universal Speaker Extraction With Visual Cue

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing