Graph Convolution-Based Deep Clustering for Speech Separation

Shan Qin,Ting Jiang,Sheng Wu,Xinran Zhao,Ning Wang

doi:10.1109/access.2020.2989833

Abstract

Deep clustering is a promising technique for speech separation that is crucial to speech communication, acoustic target detection, acoustic enhancement and speech recognition. In the study of monophonic speech separation, the problem is that the decrease in separation and generalization performance of the model in the case of reducing the variety of the training data set. In this paper, we propose a comprehensive deep clustering framework that construction the structural speech data based on GCN, named graph deep clustering (GDC) to further improve the separation performance of the separation model. In particular, embedding features are transformed into graph-structured data, and the speech separation mask is achieved by clustering these graph-structured data. Graph structural information aggregates nodes within a class, which makes feature representations conducive to clustering. Experimental results demonstrate that the proposed scheme can improve the clustering performance. The SDR of the separated speech is improved by about 1.2 dB, and the clustering accuracy is improved by 15%. We also use the perceptually motivated objective measures for the evaluation of audio source separation to score the speech quality. The target speech quality and the overall perceptual score are improved by 10.7% compared with other speech separation algorithms.

Highlights

In the coming intelligent era, human-computer voice interaction technology has been widely concerned
Speech separation is essential for the human-computer voice interaction, as its performance has a significant impact on many speech intelligent applications such as acoustic target detection, acoustic enhancement, speech recognition [1], [2]
We propose a graph deep clustering (GDC) based model to effectively shorten the in-class distance, which can improve speech quality and intelligibility

Summary

INTRODUCTION

In the coming intelligent era, human-computer voice interaction technology has been widely concerned. As a classification problem, the above algorithms don’t consider shortening the in-class distance of features to improve the feature expression ability, which can improve the generalization performance of the speech separation model. We propose a graph deep clustering (GDC) based model to effectively shorten the in-class distance, which can improve speech quality and intelligibility. A graph deep clustering speech separation model is proposed In this model, the structure information is fused in the embedding features through the graph convolution operation, which shortens the in-class distance and improves the clustering effect. This paper proposes a novel speech separation method exploiting the connection dependency graph of feature, which shortens the in-class distance by establishing long-term correlation and improve the quality of separated speech. We pay attention to shortening the in-class distance to improve the generalization performance of the model

PROPOSED MODEL

GRAPH CONVOLUTIONAL FILTER-BASED CLUSTERING

EXPERIMENTS