AbstractGenerative Adversarial Networks (GANs) have demonstrated promising results as end-to-end models for whispered to voiced speech conversion. Leveraging non-autoregressive systems like GANs capable of performing conditional waveform generation eliminates the need for separate models to estimate voiced speech features, and leads to faster inference compared to autoregressive methods. This study aims to identify the optimal GAN architecture for the whispered to voiced speech conversion task by comparing six state-of-the-art models. Furthermore, we present a method for evaluating the preservation of speaker identity and local accent, using embeddings obtained from speaker- and language identification systems. Our experimental results show that building the speech conversion system based on the HiFi-GAN architecture yields the best objective evaluation scores, outperforming the baseline by $$\sim$$ ∼ 9% relative using frequency-weighted Signal-to-Noise Ratio and Log Likelihood Ratio, as well as by $$\sim$$ ∼ 29% relative using Root Mean Squared Error. In subjective tests, HiFi-GAN yielded a mean opinion score of 2.9, significantly outperforming the baseline with a score of 1.4. Furthermore, HiFi-GAN enhanced ASR performance and preserved speaker identity and accent, with correct language detection rates of up to $$\sim$$ ∼ 98%.
Read full abstract