Single-channel speech separation based on deep clustering with local optimization

Taotao Fu,Lili Guo,Ji Liang,Yan Wang,Ge Yu

doi:10.1109/icfsp.2017.8097058

Abstract

There are many challenges in single-channel multi-person mixed speech separation, such as modeling the temporal continuity of the speech signals and improving the frame separation performance simultaneously. In this paper, a separation method based on Deep Clustering with local optimization by the improved Non-Negative Matrix Factorization (NMF) combined with Factorial Conditional Random Fields (FCRF) is proposed. Primarily, the separated voices are achieved by Deep Clustering model which are trained by the Bi-directional Long Short Term Memory (BLSTM) and clustered by the similar features. Then, separated voice are locally optimized by the improved NMF with K-means++ and FCRF iteratively. The results show the algorithm improves the separation performance, which satisfies both the local optimum of the speech signal on each frame and the continuity of the whole speech signal.

Full Text