Unseen Speakers Research Articles

End-to-end (e2e) speech synthesis systems have become popular with the recent introduction of text-to-spectrogram conversion systems, such as Tacotron, that use encoder–decoder-based neural architectures. Even though those sequence-to-sequence systems can produce mel-spectrograms from the letters without a text processing frontend, they require substantial amounts of well-manipulated, labeled audio data that have high SNR and minimum amounts of artifacts. These data requirements make it difficult to build end-to-end systems from scratch, especially for low-resource languages. Moreover, most of the e2e systems are not designed for devices with tiny memory and CPU resources. Here, we investigate using a traditional deep neural network (DNN) for acoustic modeling together with a postfilter that improves the speech features produced by the network. The proposed architectures were trained with the relatively noisy, multi-speaker, Wall Street Journal (WSJ) database and tested with unseen speakers. The thin postfilter layer was adapted with minimal data to the target speaker for testing. We investigated several postfilter architectures and compared them with both objective and subjective tests. Fully-connected and transformer-based architectures performed the best in subjective tests. The novel adversarial transformer-based architecture with adaptive discriminator loss performed the best in the objective tests. Moreover, it was faster than the other architectures both in training and inference. Thus, our proposed lightweight transformer-based postfilter architecture significantly improved speech quality and efficiently adapted to new speakers with few shots of data and a hundred training iterations, making it computationally efficient and suitable for scalability.

Previous studies on the automatic classification of voice disorders have mostly investigated the binary classification task, which aims to distinguish pathological voice from healthy voice. Using multi-class classifiers, however, more fine-grained identification of voice disorders can be achieved, which is more helpful for clinical practitioners. Unfortunately, there is little publicly available training data for many voice disorders, which lowers the classification performance on data from unseen speakers. Earlier studies have shown that the usage of glottal source features can reduce data redundancy in detection of laryngeal voice disorders. Another approach to tackle the problems caused by scarcity of training data is to utilize deep learning models, such as wav2vec 2.0 and HuBERT, that have been pre-trained on larger databases. Since the aforementioned approaches have not been thoroughly studied in the multi-class classification of voice disorders, they will be jointly studied in the present work. In addition, we study a hierarchical classifier, which enables task-wise feature optimization and more efficient utilization of data. In this work, the aforementioned three approaches are compared with traditional mel frequency cepstral coefficient (MFCC) features and one-vs-rest and one-vs-one SVM classifiers. The results in a 3-class classification problem between healthy voice and two laryngeal disorders (hyperfunctional dysphonia and vocal fold paresis) indicate that all the studied methods outperform the baselines. The best performance was achieved by using features from wav2vec 2.0 LARGE together with hierarchical classification. The balanced classification accuracy of the system was 62.77% for male speakers, and 55.36% for female speakers, which outperformed the baseline systems by an absolute improvement of 15.76% and 6.95% for male and female speakers, respectively.

Unseen Speakers Research Articles

Related Topics

Articles published on Unseen Speakers

Exploring the Impact of Fine-Tuning the Wav2vec2 Model in Database-Independent Detection of Dysarthric Speech.

Using deep learning to improve the intelligibility of a target speaker in noisy multi-talker environments for people with normal hearing and hearing loss.

MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning

SIG: Speaker Identification in Literature via Prompt-Based Generation

Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition.

Emotion selectable end-to-end text-based speech editing

Comparing accuracy in voice-based assessments of biological speaker traits across speech types

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder

LipFormer: Learning to Lipread Unseen Speakers Based on Visual-Landmark Transformers

Data augmentation for speech separation

Deep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systems

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Speaker-Independent Emotional Voice Conversion via Disentangled Representations

Hierarchical Multi-Class Classification of Voice Disorders Using Self-Supervised Models and Glottal Features

AudioLM: A Language Modeling Approach to Audio Generation

Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation

Modelling of a Speech-to-Text Recognition System for Air Traffic Control and NATO Air Command

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Unseen Speakers Research Articles

Related Topics

Articles published on Unseen Speakers

Exploring the Impact of Fine-Tuning the Wav2vec2 Model in Database-Independent Detection of Dysarthric Speech.

Using deep learning to improve the intelligibility of a target speaker in noisy multi-talker environments for people with normal hearing and hearing loss.

MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning

SIG: Speaker Identification in Literature via Prompt-Based Generation

Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition.

Emotion selectable end-to-end text-based speech editing

Comparing accuracy in voice-based assessments of biological speaker traits across speech types

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder

LipFormer: Learning to Lipread Unseen Speakers Based on Visual-Landmark Transformers

Data augmentation for speech separation

Deep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systems

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Speaker-Independent Emotional Voice Conversion via Disentangled Representations

Hierarchical Multi-Class Classification of Voice Disorders Using Self-Supervised Models and Glottal Features

AudioLM: A Language Modeling Approach to Audio Generation

Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation

Modelling of a Speech-to-Text Recognition System for Air Traffic Control and NATO Air Command