Abstract

We formulate a generic framework for blind source separation (BSS), which allows integrating data-driven spectro-temporal methods, such as deep clustering and deep attractor networks, with physically motivated probabilistic spatial methods, such as complex angular central Gaussian mixture models. The integrated model exploits the complementary strengths of the two approaches to BSS: the strong modeling power of neural networks, which, however, is based on supervised learning, and the ease of unsupervised learning of the spatial mixture models whose few parameters can be estimated on as little as a single segment of a real mixture of speech. Experiments are carried out on both artificially mixed speech and true recordings of speech mixtures. The experiments verify that the integrated models consistently outperform the individual components. We further extend the models to cope with noisy, reverberant speech and introduce a cross-domain teacher–student training where the mixture model serves as the teacher to provide training targets for the student neural network.

Highlights

  • Blind source separation addresses the problem to separate signal components originating from different sources, while only the mixture single can be observed

  • In [177] we presented a different take on unsupervised mask estimation: the likelihood under the assumption that the data follows a complex angular central Gaussian mixture model (cACGMM) is used as a maximization criterion to train a neural network which just provides the initialization to a single expectation maximization (EM)-step of the cACGMM parameter estimation process

  • Since this section covers a wide range of beamformers, different relative transfer functions (RTFs) or covariance matrix approximations and other variants, we aim at highlighting key findings concerning source separation and eliminating variants early on

Read more

Summary

Introduction

Blind source separation addresses the problem to separate signal components originating from different sources, while only the mixture single can be observed. Work concentrated on instantaneous mixtures and later got extended to cover convolutive mixtures, i.e., acoustic conditions in which a room impulse response due to the multi-path transmission in an acoustic enclosure causes a temporal smearing effect of the source signals. While blind source separation (BSS) systems were analyzed on their own for most of the time, more recently – mainly due to improved performance of separation methods and improved robustness of acoustic models – researchers started addressing the more challenging problem of multi-speaker automatic speech recognition (ASR). BSS for human listeners poses its own challenges such as the demand for low latency, avoidance of audible artifacts, and naturalness of the separation result

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.