The generalization of speech enhancement models to real-world far-field speech encounters significant challenges, including low signal-to-noise ratio, high reverberation, and variable latency between far-field and near-field recordings. Additionally, using the non-ideal near-field recordings as the labeled desired output further reduces the effectiveness of commonly utilized predictive models. To tackle these challenges, we propose the Far-field to Near-field Speech Enhancement through Supervised Adversarial Training (FNSE-SAT) strategy. This approach leverages supervised adversarial learning via the Multi-Resolution Discriminator, leveraging diverse speech characteristics with different frequency resolutions. A temporal frame shift operation is also incorporated to mitigate alignment discrepancies observed in real-world data and its effectiveness is confirmed by counting the accuracy of Voice Activity Detection. Experimental validation in both causal and non-causal configurations demonstrates that FNSE-SAT significantly outperforms the state-of-the-art predictive model on real-world datasets. Furthermore, adopting the transfer learning strategy, where the model is initialized with a simulated dataset before fine-tuning with real-world data, strengthens the efficacy of FNSE-SAT, leading to superior outcomes. The results of character error rate show that FNSE-SAT generates fewer components that deviate from the textual content compared to the generative diffusion method. Reducing the Discriminator's resolution to a single version decreases the DNSMOS but has a slight effect on the character error rate.
Read full abstract