Speech Synthesis Model Research Articles

Objective. Brain-computer interfaces (BCIs) have the potential to preserve or restore speech in patients with neurological disorders that weaken the muscles involved in speech production. However, successful training of low-latency speech synthesis and recognition models requires alignment of neural activity with intended phonetic or acoustic output with high temporal precision. This is particularly challenging in patients who cannot produce audible speech, as ground truth with which to pinpoint neural activity synchronized with speech is not available.Approach. In this study, we present a new iterative algorithm for neural voice activity detection (nVAD) called iterative alignment discovery dynamic time warping (IAD-DTW) that integrates DTW into the loss function of a deep neural network (DNN). The algorithm is designed to discover the alignment between a patient's electrocorticographic (ECoG) neural responses and their attempts to speak during collection of data for training BCI decoders for speech synthesis and recognition.Main results. To demonstrate the effectiveness of the algorithm, we tested its accuracy in predicting the onset and duration of acoustic signals produced by able-bodied patients with intact speech undergoing short-term diagnostic ECoG recordings for epilepsy surgery. We simulated a lack of ground truth by randomly perturbing the temporal correspondence between neural activity and an initial single estimate for all speech onsets and durations. We examined the model's ability to overcome these perturbations to estimate ground truth. IAD-DTW showed no notable degradation (<1% absolute decrease in accuracy) in performance in these simulations, even in the case of maximal misalignments between speech and silence.Significance. IAD-DTW is computationally inexpensive and can be easily integrated into existing DNN-based nVAD approaches, as it pertains only to the final loss computation. This approach makes it possible to train speech BCI algorithms using ECoG data from patients who are unable to produce audible speech, including those with Locked-In Syndrome.

Read full abstract

This study presents a Pareto optimized flow-based generative network for speech synthesis - the P-GLOW model in Lithuanian speech synthesis for substituting original voices affected by cancer-related pathologies. Comparing this pure Lithuanian model with an English model trained on Lithuanian phonemes, the study emphasizes the impact of language-specific characteristics on performance, illustrating the advantages of tailored, language-specific models for speech substitution for impaired persons. Our methodology integrates generative models with an architecture based on 12 coupling layers, 1x1 invertible convolutions, and dilated convolutions with gated hyperbolic tangent activation functions. This design facilitates the effective transformation of waveform inputs into Mel spectrograms, enabling the synthesis of high-quality speech audio. The proposed, Pareto optimized, native Lithuanian model achieved a MOS of 4.2 ± 0.1 and an SMOS of 3.9 ± 0.2, while the baseline model (English language model trained on top with Lithuanian phonemes) scored lower with a MOS of 3.8 ± 0.2 and an SMOS of 3.4 ± 0.3. In terms of the Mel Cepstral Distortion (MCD), Voiced/Unvoiced Decision Error (VDE), Glottal-to-Noise Excitation Ratio (GPE), and Frame Fidelity Error (FFE) scores, the proposed method achieved an MCD of 3.2 dB ± 0.2, a VDE of 5.6% ± 1.0, a GPE of 7.9% ± 1.4, and an FFE of 9.7% ± 2.0, outperforming the baseline method which scored higher with an MCD of 4.6 dB ± 0.3, a VDE of 10.8% ± 1.9, a GPE of 14.5% ± 2.4, and an FFE of 18.7% ± 2.8. Log Likelihood Ratio (LLR), Weighted Spectral Slope (WSS), and Perceptual Evaluation of Speech Quality (PESQ) scores also showed strong improvement. The proposed method achieved an LLR of 0.5 ± 0.05, a WSS of 0.3 ± 0.03, and a PESQ of 4.2 ± 0.1, while the baseline method scored lower with an LLR of 0.8 ± 0.08, a WSS of 0.5 ± 0.05, and a PESQ of 3.6 ± 0.2. The Speech Intelligibility Index (SII) and Short-Time Objective Intelligibility (STOI) metrics for restored alaryngeal speech were also high. The proposed method achieved an SII of 0.75 ± 0.05 and an STOI of 0.85 ± 0.03, while the baseline method scored lower with an SII of 0.65 ± 0.05 and an STOI of 0.75 ± 0.04. The evaluation results demonstrate that our approach was able to effectively synthesize high-quality Lithuanian speech audio restoring the input signal from distorted alaryngeal speech.

Read full abstract

Speech Synthesis Model Research Articles

Related Topics

Articles published on Speech Synthesis Model

Iterative alignment discovery of speech-associated neural activity.

Synthesizing Lithuanian voice replacement for laryngeal cancer patients with Pareto-optimized flow-based generative synthesis network

Attention-based speech feature transfer between speakers.

Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet

VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation

A Study of Artificial Intelligence-Assisted Listening Training in College English Teaching

Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder

Research on Speech Synthesis Based on Mixture Alignment Mechanism.

The Evaluation of Performance Related to Noise Robustness of VITS for Speech Synthesis

Deepfake Speech Recognition and Detection

Intrinsic velocity differences between larynx raising and larynx lowering.

ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on VITS

Deep Learning Speech Synthesis Model for Word/Character-Level Recognition in the Tamil Language

Advancements in Arabic Text-to-Speech Systems: A 22-Year Literature Review

Emotional Vietnamese Speech Synthesis Using Style-Transfer Learning

Improving Few-Shot Multi-Speaker Text-to-Speech Adaptive-Based with Extracting Mel-Vector (EMV) for Vietnamese

FastSpeechStyle: Fast, Emotion Controllable, and High-Quality Speech Synthesis

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Implementing a Statistical Parametric Speech Synthesis System for a Patient with Laryngeal Cancer.

MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Speech Synthesis Model Research Articles

Related Topics

Articles published on Speech Synthesis Model

Iterative alignment discovery of speech-associated neural activity.

Synthesizing Lithuanian voice replacement for laryngeal cancer patients with Pareto-optimized flow-based generative synthesis network

Attention-based speech feature transfer between speakers.

Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet

VioLA: Conditional Language Models for Speech Recognition, Synthesis, and Translation

A Study of Artificial Intelligence-Assisted Listening Training in College English Teaching

Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder

Research on Speech Synthesis Based on Mixture Alignment Mechanism.

The Evaluation of Performance Related to Noise Robustness of VITS for Speech Synthesis

Deepfake Speech Recognition and Detection

Intrinsic velocity differences between larynx raising and larynx lowering.

ZSE-VITS: A Zero-Shot Expressive Voice Cloning Method Based on VITS

Deep Learning Speech Synthesis Model for Word/Character-Level Recognition in the Tamil Language

Advancements in Arabic Text-to-Speech Systems: A 22-Year Literature Review

Emotional Vietnamese Speech Synthesis Using Style-Transfer Learning

Improving Few-Shot Multi-Speaker Text-to-Speech Adaptive-Based with Extracting Mel-Vector (EMV) for Vietnamese

FastSpeechStyle: Fast, Emotion Controllable, and High-Quality Speech Synthesis

Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control

Implementing a Statistical Parametric Speech Synthesis System for a Patient with Laryngeal Cancer.

MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer