Abstract

This study tightly integrates complementary spectral and spatial features for deep learning based multi-channel speaker separation in reverberant environments. The key idea is to localize individual speakers so that an enhancement network can be trained on spatial as well as spectral features to extract the speaker from an estimated direction and with specific spectral structures. The spatial and spectral features are designed in a way such that the trained models are blind to the number of microphones and microphone geometry. To determine the direction of the speaker of interest, we identify time-frequency T-F units dominated by that speaker and only use them for direction estimation. The T-F unit level speaker dominance is determined by a two-channel chimera++ network, which combines deep clustering and permutation invariant training at the objective function level, and integrates spectral and interchannel phase patterns at the input feature level. In addition, T-F masking based beamforming is tightly integrated in the system by leveraging the magnitudes and phases produced by beamforming. Strong separation performance has been observed on reverberant talker-independent speaker separation, which separates reverberant speaker mixtures based on a random number of microphones arranged in arbitrary linear-array geometry.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call