Speech Separation Using Augmented-Discrimination Learning on Squash-Norm Embedding Vector and Node Encoder

Ha Minh Tan,Jia-Ching Wang,Chung-Ting Li,Yung-Hui Li,Yuan-Shan Lee,Kai-Wen Liang

doi:10.1109/access.2022.3188712

Ha Minh Tan, Jia-Ching Wang + Show 4 more

Open Access

https://doi.org/10.1109/access.2022.3188712

Copy DOI

Abstract

Speech separation has been employed in important applications such as automatic speech, paralinguistics, speech recognition, hearing aids, and human-machine interactions. In recent years, deep neural networks have been widely used for speech and music separation. Some of these breakthrough successful models based on embedding vectors have been proposed, such as deep clustering. In this paper, we propose a node encoder Squash-norm deep clustering (ESDC) as an enhanced discriminative learning framework by combining node encoder, Squash-norm, and deep clustering (DC). First, a node encoder is used to create intermediate features. Node encoders are developed through a matrix factorization-based learning method for graph representations. It creates distinguishable intermediate features that play an important role in improving performance. These discriminated intermediate features are then used as input features for the separation block. The decoder block finally constructs the estimation mask through the clustering method and reconstructs the estimated signal for each source. In particular, we apply a normalization function, Squash-norm, to the input and output vectors to enhance the distinction between high-dimensional embedding vectors. This nonlinear function amplifies the differences in the input vectors, resulting in highly unique features, which are scalar products of the vectors. Similar to the input vector, Squash-norm also enhances the discrimination of the output vector, thereby enhancing the ability to construct an estimated mask by clustering the output vector. Overall, the proposed ESDC achieves 1.27 – 2.09 dB SDR, 1.28 – 2.21 dB SDRi, and 1.3 – 2.44 dB SI-SNRi gain compared to the DC baseline separation performance across genders on the TSP and TIMIT datasets. With the same gender, our proposed ESDC achieves 1.14 – 2.71 dB SDR, 0.99 – 2.74 dB SDRi, and 0.62 – 2.86 dB SI-SNRi gain compared with the DC baseline on the TIMIT dataset. In all cases, the proposed ESDC model consistently maintains STOI and PESQ higher than the DC baselines on the TSP and TIMIT datasets.

Full Text