Time-frequency (T-F) masking is an effective method for stereo speech source separation. However, reliable estimation of the T-F mask from sound mixtures is a challenging task, especially when room reverberations are present in the mixtures. In this paper, we propose a new stereo speech separation system where deep neural networks are used to generate soft T-F mask for separation. More specifically, the deep neural network, which is composed of two sparse autoencoders and a softmax regression, is used to estimate the orientations of the dominant source at each T-F unit, based on low-level features, such as mixing vector (MV), interaural level, and phase difference (IPD/ILD). The dataset for training the networks was generated by the convolution of binaural room impulse responses (RIRs) and clean speech signals positioned in different angles with respect to the sensors. With the training dataset, we use unsupervised learning to extract high-level features from low-level features and use supervised learning to find the nonlinear functions between high-level features and the orientations of dominant source. By using the trained networks, the probability that each T-F unit belongs to different sources (target and interferers) can be estimated based on the localization cues which is further used to generate the soft mask for source separation. Experiments based on real binaural RIRs and TIMIT dataset are provided to show the performance of the proposed system for reverberant speech mixtures, as compared with a model-based T-F masking technique proposed recently.
Read full abstract