Abstract

The recent introduction of neural networks to speech separation has dramatically boosted the separation performance. This paper presents a novel psychoacoustic approach for speech source separation in anechoic conditions, using semantic segmentation of the interaural spectrograms of the audio mixtures. We have trained two separate U-Nets (a specialized neural network for semantic segmentation) on the interaural level difference (ILD) spectrogram, and the interaural phase difference (IPD) spectrogram of a single source. After training, these U-Nets are used to predict the class of each time frequency (TF) unit of the interaural spectrogram of the audio mixture. The ILD and IPD soft masks obtained from these U-Nets are combined by a novel scheme which utilizes the strength of the interaural cues in different frequency bands. The results show improved separation over two state of the art machine learning source separation systems utilizing the same interaural cues. There is average improvement of 7.32 dB in signal to distortion ratio (SDR) and 0.3 points improvement in short term objective intelligibility (STOI) over degenerate un-mixing estimation technique (DUET) algorithm and 2.51 dB improvement in SDR with comparable intelligibility over model-based expectation–maximization source separation and localization (MESSL) algorithm.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.