Abstract

Speaker-conditioned target speaker extraction aims at estimating the target speaker from a mixture of speakers utilizing auxiliary information about the target speaker. In this paper, we consider a single-channel target speaker extraction system consisting of a speaker embedder network and a speaker separator network. Instead of using standard long short-term memory (LSTM) cells in the separator network, we propose two variants of LSTM cells that are customized for speaker-conditioned target speaker extraction. The first variant customizes both the forget gate and input gate of the LSTM cell, aiming at retaining only relevant features related to target speaker and disregarding the interfering speakers by simultaneously resetting and updating the cell state using the speaker embedding. For the second variant, we introduce a new gate within the LSTM cell, referred to as auxiliary-modulation gate. This gate modulates the information processing during cell state reset, aiming at learning the long-term and short-term discriminative features of the target speaker. Both in unidirectional and bidirectional mode, experimental results on 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures (containing 1, 2, or 3 speakers) show that both proposed variants of LSTM cells outperform the standard LSTM cells for target speaker extraction, where the best performance is obtained using the auxiliary-gated LSTM cells.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.