Abstract

There are many challenges in single-channel multi-person mixed speech separation, such as modeling the temporal continuity of the speech signals and improving the frame separation performance simultaneously. In this paper, a separation method based on Deep Clustering with local optimization by the improved Non-Negative Matrix Factorization (NMF) combined with Factorial Conditional Random Fields (FCRF) is proposed. Primarily, the separated voices are achieved by Deep Clustering model which are trained by the Bi-directional Long Short Term Memory (BLSTM) and clustered by the similar features. Then, separated voice are locally optimized by the improved NMF with K-means++ and FCRF iteratively. The results show the algorithm improves the separation performance, which satisfies both the local optimum of the speech signal on each frame and the continuity of the whole speech signal.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call