Multi-speaker Scenario Research Articles

Multi-speaker spoken datasets enable the creation of text-to-speech synthesis (TTS) systems which can output several voice identities. The multi-speaker (MSPK) scenario also enables the use of fewer training samples per speaker. However, in the resulting acoustic model, not all speakers exhibit the same synthetic quality, and some of the voice identities cannot be used at all.In this paper we evaluate the influence of the recording conditions, speaker gender, and speaker particularities over the quality of the synthesised output of a deep neural TTS architecture, namely Tacotron2. The evaluation is possible due to the use of a large Romanian parallel spoken corpus containing over 81 hours of data. Within this setup, we also evaluate the influence of different types of text representations: orthographic, phonetic, and phonetic extended with syllable boundaries and lexical stress markings.We evaluate the results of the MSPK system using the objective measures of equal error rate (EER) and word error rate (WER), and also look into the distances between natural and synthesised t-SNE projections of the embeddings computed by an accurate speaker verification network. The results show that there is indeed a large correlation between the recording conditions and the speaker’s synthetic voice quality. The speaker gender does not influence the output, and that extending the input text representation with syllable boundaries and lexical stress information does not equally enhance the generated audio across all speaker identities. The visualisation of the t-SNE projections of the natural and synthesised speaker embeddings show that the acoustic model shifts some of the speakers’ neural representation, but not all of them. As a result, these speakers have lower performances of the output speech.

Read full abstract

Selectively attending to one speaker in a multi-speaker scenario is thought to synchronize low-frequency cortical activity to the attended speech signal. In recent studies, reconstruction of speech from single-trial electroencephalogram (EEG) data has been used to decode which talker a listener is attending to in a two-talker situation. It is currently unclear how this generalizes to more complex sound environments. Behaviorally, speech perception is robust to the acoustic distortions that listeners typically encounter in everyday life, but it is unknown whether this is mirrored by a noise-robust neural tracking of attended speech. Here we used advanced acoustic simulations to recreate real-world acoustic scenes in the laboratory. In virtual acoustic realities with varying amounts of reverberation and number of interfering talkers, listeners selectively attended to the speech stream of a particular talker. Across the different listening environments, we found that the attended talker could be accurately decoded from single-trial EEG data irrespective of the different distortions in the acoustic input. For highly reverberant environments, speech envelopes reconstructed from neural responses to the distorted stimuli resembled the original clean signal more than the distorted input. With reverberant speech, we observed a late cortical response to the attended speech stream that encoded temporal modulations in the speech signal without its reverberant distortion. Single-trial attention decoding accuracies based on 40–50s long blocks of data from 64 scalp electrodes were equally high (80–90% correct) in all considered listening environments and remained statistically significant using down to 10 scalp electrodes and short (<30-s) unaveraged EEG segments. In contrast to the robust decoding of the attended talker we found that decoding of the unattended talker deteriorated with the acoustic distortions. These results suggest that cortical activity tracks an attended speech signal in a way that is invariant to acoustic distortions encountered in real-life sound environments. Noise-robust attention decoding additionally suggests a potential utility of stimulus reconstruction techniques in attention-controlled brain-computer interfaces.

Read full abstract

Multi-speaker Scenario Research Articles

Related Topics

Articles published on Multi-speaker Scenario

ATC-SD Net: Radiotelephone Communications Speaker Diarization Network

Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments

Exploring the influence of bilingual experience on speech-in-competition measures

Speaker identification and localization using shuffled MFCC features and deep learning

Speaker Verification Based on Single Channel Speech Separation

Target Speaker Extraction by Fusing Voiceprint Features

SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability

Extracting the Auditory Attention in a Dual-Speaker Scenario From EEG Using a Joint CNN-LSTM Model.

Auditory attention decoding from electroencephalography based on long short-term memory networks

Unsupervised Abstractive Dialogue Summarization for Tete-a-Tetes

EEG-based detection of the locus of auditory attention with convolutional neural networks.

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Perception of Rhythmic Speech Is Modulated by Focal Bilateral Transcranial Alternating Current Stimulation.

Successive Relative Transfer Function Identification Using Blind Oblique Projection

An Interpretable Performance Metric for Auditory Attention Decoding Algorithms in a Context of Neuro-Steered Gain Control.

Speech Activity Detection in Naturalistic Audio Environments: Fearless Steps Apollo Corpus

Noise-robust cortical tracking of attended speech in real-world acoustic scenes

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

Neuro-steered noise suppression for auditory prostheses

Effects of age on electrophysiological correlates of speech processing in a dynamic "cocktail-party" situation.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multi-speaker Scenario Research Articles

Related Topics

Articles published on Multi-speaker Scenario

ATC-SD Net: Radiotelephone Communications Speaker Diarization Network

Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments

Exploring the influence of bilingual experience on speech-in-competition measures

Speaker identification and localization using shuffled MFCC features and deep learning

Speaker Verification Based on Single Channel Speech Separation

Target Speaker Extraction by Fusing Voiceprint Features

SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability

Extracting the Auditory Attention in a Dual-Speaker Scenario From EEG Using a Joint CNN-LSTM Model.

Auditory attention decoding from electroencephalography based on long short-term memory networks

Unsupervised Abstractive Dialogue Summarization for Tete-a-Tetes

EEG-based detection of the locus of auditory attention with convolutional neural networks.

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Perception of Rhythmic Speech Is Modulated by Focal Bilateral Transcranial Alternating Current Stimulation.

Successive Relative Transfer Function Identification Using Blind Oblique Projection

An Interpretable Performance Metric for Auditory Attention Decoding Algorithms in a Context of Neuro-Steered Gain Control.

Speech Activity Detection in Naturalistic Audio Environments: Fearless Steps Apollo Corpus

Noise-robust cortical tracking of attended speech in real-world acoustic scenes

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

Neuro-steered noise suppression for auditory prostheses

Effects of age on electrophysiological correlates of speech processing in a dynamic "cocktail-party" situation.