Abstract

Speaker extraction aims to extract the speech signal of the speaker of interest from a mixture of two or more speakers. We propose a set of novel psychoacoustic-based loss functions that each can be used to optimize a stacked network of Bidirectional Long Short Term Memories (BLSTM) to imitate the human hearing system, which has an extraordinary ability to perceive and separate speech signals. To do this, we propose to use Mel and Gammatone filter banks as well as perceptual loudness and power-law of hearing effects in the loss functions of BLSTMs. The evaluation results on the Speech Separation Corpus (SSC) show that the proposed approach outperforms the baseline methods in terms of Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ) and Speech to Distortion Ratio (SDR). The proposed approach leads to an improvement of up to 0.276 in PESQ results compared to baseline methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call