Abstract

Speech separation is an important task of separating a target speech from the mixture signals. Speaker-independent multi-talker speech separation is a challenging task due to unpredictability of the target and interfering speech in the target-interference mixtures. Conventionally, speech separation is used as a signal processing problem, but recently it is formulated as a deep learning problem and discriminative patterns of the speech are learned from the training data. In this paper, we consider the ideal binary mask (IBM) as a supervised binary classification training-target by using fully connected deep neural networks (DNN) for single-channel speaker-independent multi-talker speech separation. The train DNNs is used to estimate IBM training-target. The mean square error (MSE) is used as an objective cost function. Standard backpropagation and Monte-Carlo dropout regularization approaches are used for better generalization and overfitting during training. The estimated training-target is applied to the mixtures to obtain the separated target speech. We have addressed the over-smoothing problem and performed equalization of spectral variances to match the estimated and clean speech features. Our experimental results in various evaluating conditions report that the proposed method outperformed the competing methods in terms of the Perceptual Evaluation of Speech Quality (PESQ), Segmental SNR (SNRSeg), Short-time objective intelligibility (STOI), normalized Frequency weighted SNRSeg (nFwSNRSeg) and HIT-FA rates.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call