Target speaker extraction (TSE) which has the capability to directly extract desired speech given enrollment utterances of the target speaker has attracted more and more attention for its potential applications in solving the cocktail-party problem. Despite the considerable progress made by existing time-domain methods, which have become the dominant approach for TSE, these methods often significantly degrade their performance under more realistic conditions. This paper proposes an innovative approach in the time–frequency (T–F) domain, namely X-TF-GridNet, which uses complex spectrum mapping to extract the real and imaginary (RI) components of the target speech. Specifically, the TF-GridNet block was designed to serve as the primary speaker extractor module. Our proposed method boasts two key extensions: first, a U2-Net style network adeptly extracts robust fixed speaker embeddings, which could efficiently capture and represent target speaker information. Second, an adaptive embedding fusion (AEA) mechanism ensures the effective utilization of target speaker information, which makes the backbone extractor focus on the speech of interest. Additionally, we also introduced a multi-task learning framework, comprising two distinct loss functions, to explicitly enhance both the discriminative speaker embeddings for the reference speech and the overall quality of the target speech. We conducted extensive ablation studies and quantitative comparisons against previous TSE methods on both the WSJ0-2mix and its noisy and reverberant counterparts. Our proposed method achieved a commendable SI-SDR of 19.7 dB with a moderate model size on the WSJ0-2mix dataset, and the SI-SDR can be improved to 20.7 dB with a larger model. Experimental results demonstrated that, compared with existing time-domain approaches, our proposed method not only achieved competitive performance across multiple objective metrics but also mitigated speaker confusion errors under more challenging conditions, including various interferences such as noises and reverberation.
Read full abstract