Synthesizing Lithuanian voice replacement for laryngeal cancer patients with Pareto-optimized flow-based generative synthesis network

R Maskeliunas,R Damasevicius,A Kulikajevas,K Pribuisis,N Ulozaite-Staniene,V Uloza

doi:10.1016/j.apacoust.2024.110097

Abstract

This study presents a Pareto optimized flow-based generative network for speech synthesis - the P-GLOW model in Lithuanian speech synthesis for substituting original voices affected by cancer-related pathologies. Comparing this pure Lithuanian model with an English model trained on Lithuanian phonemes, the study emphasizes the impact of language-specific characteristics on performance, illustrating the advantages of tailored, language-specific models for speech substitution for impaired persons. Our methodology integrates generative models with an architecture based on 12 coupling layers, 1x1 invertible convolutions, and dilated convolutions with gated hyperbolic tangent activation functions. This design facilitates the effective transformation of waveform inputs into Mel spectrograms, enabling the synthesis of high-quality speech audio. The proposed, Pareto optimized, native Lithuanian model achieved a MOS of 4.2 ± 0.1 and an SMOS of 3.9 ± 0.2, while the baseline model (English language model trained on top with Lithuanian phonemes) scored lower with a MOS of 3.8 ± 0.2 and an SMOS of 3.4 ± 0.3. In terms of the Mel Cepstral Distortion (MCD), Voiced/Unvoiced Decision Error (VDE), Glottal-to-Noise Excitation Ratio (GPE), and Frame Fidelity Error (FFE) scores, the proposed method achieved an MCD of 3.2 dB ± 0.2, a VDE of 5.6% ± 1.0, a GPE of 7.9% ± 1.4, and an FFE of 9.7% ± 2.0, outperforming the baseline method which scored higher with an MCD of 4.6 dB ± 0.3, a VDE of 10.8% ± 1.9, a GPE of 14.5% ± 2.4, and an FFE of 18.7% ± 2.8. Log Likelihood Ratio (LLR), Weighted Spectral Slope (WSS), and Perceptual Evaluation of Speech Quality (PESQ) scores also showed strong improvement. The proposed method achieved an LLR of 0.5 ± 0.05, a WSS of 0.3 ± 0.03, and a PESQ of 4.2 ± 0.1, while the baseline method scored lower with an LLR of 0.8 ± 0.08, a WSS of 0.5 ± 0.05, and a PESQ of 3.6 ± 0.2. The Speech Intelligibility Index (SII) and Short-Time Objective Intelligibility (STOI) metrics for restored alaryngeal speech were also high. The proposed method achieved an SII of 0.75 ± 0.05 and an STOI of 0.85 ± 0.03, while the baseline method scored lower with an SII of 0.65 ± 0.05 and an STOI of 0.75 ± 0.04. The evaluation results demonstrate that our approach was able to effectively synthesize high-quality Lithuanian speech audio restoring the input signal from distorted alaryngeal speech.

Full Text