Abstract
This letter describes a time-varying extension of independent vector analysis (IVA) based on the normalizing flow (NF), called NF-IVA, for determined blind source separation of multichannel audio signals. As in IVA, NF-IVA estimates demixing matrices that transform mixture spectra to source spectra in the complex-valued spatial domain such that the likelihood of those matrices for the mixture spectra is maximized under some non-Gaussian source model. While IVA performs a time-invariant bijective linear transformation, NF-IVA performs a series of time-varying bijective linear transformations (flow blocks) adaptively predicted by neural networks. To regularize such transformations, we introduce a soft volume-preserving (VP) constraint. Given mixture spectra, the parameters of NF-IVA are optimized by gradient descent with backpropagation in an unsupervised manner. Experimental results show that NF-IVA successfully performs speech separation in reverberant environments with different numbers of speakers and microphones and that NF-IVA with the VP constraint outperforms NF-IVA without it, standard IVA with iterative projection, and improved IVA with gradient descent.
Highlights
The widespread use of devices equipped with many microphones, e.g., smart speakers and smartphones, demands audio source separation methods that can effectively exploit the spatial information captured in the multichannel recordings [1], [2]
We show that standard independent vector analysis (IVA) can be interpreted as a simple normalizing flow (NF) with a single flow step and extended to general NF-IVA based on a more expressive NF with a series of flow steps
As a generalization of IVA, we propose NF-IVA that uses more than one flow step grouped into flow blocks performing time-varying transformation
Summary
The widespread use of devices equipped with many microphones, e.g., smart speakers and smartphones, demands audio source separation methods that can effectively exploit the spatial information captured in the multichannel recordings [1], [2]. Such methods are useful for downstream applications, e.g., automatic speech recognition (ASR) and human listening. While the supervised approaches based on deep neural networks (DNNs) [3]–[5] have been shown to work well, unsupervised separation, a.k.a. blind source separation (BSS), techniques are potentially better to handle unseen, unknown environments. The associate editor coordinating the review of this manuscript and approving it for publication was Prof.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.