Abstract

Correlation of speech data is ubiquitous in real-world environments. Heterogeneous data involving different distributions hinder federated learning (FL) significantly, which can lead to drifted global models exhibiting intractable convergence. Based on the differences between speech and noise distributions, this paper proposes self-adaptive noise distribution network for speech enhancement (SASE)—a complex-domain denoising model with FL—to tackle the problem of data heterogeneity. We propose a complex-valued time–frequency gate attention mechanism (TF-GA), which is optimized in both time and frequency domains to enable the extraction of rich speech and noise distribution information. Further, we construct the self-adaptive Gaussian unitary ensemble attention (SA-GUEA) block in the SASE network to make it adaptable to the noise distribution. Data heterogeneity is addressed using the SASE model by developing the CommonVoice (CV) dataset with noise—this large heterogeneous dataset contains several speakers, acoustic environments, and noise obtained from different devices. In a realistic FL experimental environment, we also develop the Loss-based and PESQ-based optimization weighting strategies that intelligently update the server model with a large-scale heterogeneous dataset intelligently, leading to better generalization performance corresponding to non-independent and identical distribution (non-IID) data distributions. Empirical studies based on this theoretical framework demonstrate that the proposed SASE model not only exhibits high applicability to unknown noise on an independent and identical distribution (IID) VoiceBank + DEMAND dataset, but also achieves successful denoising for real environment noise on the non-IID CV Chinese + Noise92 and CV English + TUT datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call