Switching Independent Vector Analysis and its Extension to Blind and Spatially Guided Convolutional Beamforming Algorithms

Tomohiro Nakatani,Rintaro Ikeshita,Shoko Araki,Hiroshi Sawada,Keisuke Kinoshita,Naoyuki Kamo

doi:10.1109/taslp.2022.3155271

Abstract

This paper develops a framework that can accurately perform denoising, dereverberation, and source separation using a relatively small number of microphones. It has been empirically confirmed that Independent Vector Analysis (IVA) can blindly separate <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$N$</tex-math></inline-formula> sources from their sound mixture even with diffuse noise when a sufficiently large number ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$=M$</tex-math></inline-formula> ) of microphones are available (i.e., <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$M\gg N)$</tex-math></inline-formula> . However, the estimation accuracy is seriously degraded when the number of microphones, or more specifically <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$M-N$</tex-math></inline-formula> <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$(\geq 0)$</tex-math></inline-formula> , decreases. To overcome this IVA limitation, we propose switching IVA (swIVA) in this paper. With swIVA, the time frames of an observed signal with time-varying characteristics are clustered into several groups, each of which can be well handled by IVA with a small number of microphones, and thus accurate estimation can be achieved by individually applying IVA to each group. Conventionally, a switching mechanism was introduced into a Minimum-Variance Distortionless Response (MVDR) beamformer, and this paper extends the mechanism to work with a blind source separation algorithm. To incorporate dereverberation capability, we further extend swIVA to a blind Convolutional beamforming algorithm (swCIVA) that integrates swIVA and switching Weighted Prediction Error-based dereverberation (swWPE) in a jointly optimal way. With swCIVA, two different time-varying characteristics of an observed signal are captured for dereverberation and source separation to achieve effective estimation. We show that both swIVA and swCIVA can be optimized effectively based on blind signal processing, and their performance can be further improved using a spatial guide for initialization. Experiments demonstrate that both the proposed methods largely outperformed conventional IVA and its convolutional beamforming extension (CIVA) in terms of objective signal quality and automatic speech recognition scores when using relatively few microphones.

Full Text