Abstract

This work presents a two-stage speech source separation algorithm based on combined models of interaural cues and spatial covariance which utilize knowledge of the locations of the sources estimated through video. In the first pre-processing stage the late reverberant speech components are suppressed by a spectral subtraction rule to dereverberate the observed mixture. In the second stage, the binaural spatial parameters, the interaural phase difference and the interaural level difference, and the spatial covariance are modeled in the short-time Fourier transform (STFT) domain to classify individual time-frequency (TF) units to each source. The parameters of these probabilistic models and the TF regions assigned to each source are updated with the expectation-maximization (EM) algorithm. The algorithm generates TF masks that are used to reconstruct the individual speech sources. Objective results, in terms of the signal-to-distortion ratio (SDR) and the perceptual evaluation of speech quality (PESQ), confirm that the proposed multimodal method with pre-processing is a promising approach for source separation in highly reverberant rooms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call