The recent introduction of neural networks to speech separation has dramatically boosted the separation performance. This paper presents a novel psychoacoustic approach for speech source separation in anechoic conditions, using semantic segmentation of the interaural spectrograms of the audio mixtures. We have trained two separate U-Nets (a specialized neural network for semantic segmentation) on the interaural level difference (ILD) spectrogram, and the interaural phase difference (IPD) spectrogram of a single source. After training, these U-Nets are used to predict the class of each time frequency (TF) unit of the interaural spectrogram of the audio mixture. The ILD and IPD soft masks obtained from these U-Nets are combined by a novel scheme which utilizes the strength of the interaural cues in different frequency bands. The results show improved separation over two state of the art machine learning source separation systems utilizing the same interaural cues. There is average improvement of 7.32 dB in signal to distortion ratio (SDR) and 0.3 points improvement in short term objective intelligibility (STOI) over degenerate un-mixing estimation technique (DUET) algorithm and 2.51 dB improvement in SDR with comparable intelligibility over model-based expectation–maximization source separation and localization (MESSL) algorithm.
Read full abstract