Mix‐MaxETTS: A text‐to‐emotional speech synthesis model based on a deep encoder–decoder structure for the transfer of secondary emotions

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract Given the importance of emotions in social interactions, emotional speech synthesis has attracted significant attention in the field of human–computer interaction. Remarkable advancements have been made in emotional text‐to‐speech synthesis, but most previous studies have concentrated on imitating styles associated with a specific primary emotion, neglecting secondary emotions that arise from mixtures of primary emotions. Therefore, there is a need to leverage both primary and secondary emotions in speech synthesis to facilitate more engaging, realistic, and natural interactions among artificial social agents. To address this gap, we propose a text‐to‐emotional speech synthesis model designed to generate nuanced mixtures of emotions that effectively convey secondary emotions during interactions. By adjusting the values of each basic emotion, we can control the mix of emotions in the synthetic speech. Our proposed method distinguishes between primary emotions and variations in mixed emotions while learning emotional styles. The effectiveness of the proposed framework was validated through both objective and subjective evaluations.

Similar Papers
  • Research Article
  • 10.1182/blood-2024-205938
Using Artificial Intelligence to Identify Patient Emotions across the Multiple Myeloma Diagnosis and Treatment Journey
  • Nov 5, 2024
  • Blood
  • Noreen Ali + 3 more

Using Artificial Intelligence to Identify Patient Emotions across the Multiple Myeloma Diagnosis and Treatment Journey

  • Research Article
  • Cite Count Icon 20
  • 10.1080/00207590444000221
The lay distinction between primary and secondary emotions: A spontaneous categorization?
  • Apr 1, 2005
  • International Journal of Psychology
  • Ramón Rodríguez‐Torres + 6 more

In line with the psychological essentialism perspective, Leyens et al. (2000) have hypothesized that people attribute different essences to groups and that they attribute more uniquely human characteristics to their own group than to out-groups. Leyens et al. have focused on two types of emotions, which in Roman languages have specific labels, such as sentimientos and emociones in Spanish. A cross-cultural study showed that sentimientos (or secondary emotions) are considered uniquely human emotions whereas emociones (or primary emotions) are perceived as nonuniquely human emotions. The present study focuses on whether this categorization into primary and secondary emotions is a spontaneous distinction that people use in their everyday lives, or whether, on the contrary, it is the result of experimental demands. The paradigm "Who says what to whom" was used to test this question. Geometrical shapes of different colours were systematically associated with different stimuli that varied in meaningfulness. In a first condition, shapes were associated with small or large items of furniture (meaningful categories) and with primary and secondary emotions. In a second condition, the items of furniture were replaced by words ending with a vowel or a consonant (meaningless categories). Subsequently, participants had to recognize which shape was associated with each stimulus. Intra-category errors were significantly more numerous than inter-category errors, except for the words ending with a vowel or a consonant. Stated otherwise, types of emotions were recognized like the meaningful difference between items of furniture. These results show that the distinction between primary and secondary emotions is an implicit one that people use spontaneously, and not as a result of task demands. The findings are discussed from the perspective of psychological essentialism and inter-group relations.

  • PDF Download Icon
  • Research Article
  • 10.3390/s23062999
Exploring Prosodic Features Modelling for Secondary Emotions Needed for Empathetic Speech Synthesis
  • Mar 10, 2023
  • Sensors
  • Jesin James + 3 more

A low-resource emotional speech synthesis system for empathetic speech synthesis based on modelling prosody features is presented here. Secondary emotions, identified to be needed for empathetic speech, are modelled and synthesised in this investigation. As secondary emotions are subtle in nature, they are difficult to model compared to primary emotions. This study is one of the few to model secondary emotions in speech as they have not been extensively studied so far. Current speech synthesis research uses large databases and deep learning techniques to develop emotion models. There are many secondary emotions, and hence, developing large databases for each of the secondary emotions is expensive. Hence, this research presents a proof of concept using handcrafted feature extraction and modelling of these features using a low-resource-intensive machine learning approach, thus creating synthetic speech with secondary emotions. Here, a quantitative-model-based transformation is used to shape the emotional speech's fundamental frequency contour. Speech rate and mean intensity are modelled via rule-based approaches. Using these models, an emotional text-to-speech synthesis system to synthesise five secondary emotions-anxious, apologetic, confident, enthusiastic and worried-is developed. A perception test to evaluate the synthesised emotional speech is also conducted. The participants could identify the correct emotion in a forced response test with a hit rate greater than 65%.

  • Book Chapter
  • Cite Count Icon 46
  • 10.1007/978-3-540-85483-8_2
Affect Simulation with Primary and Secondary Emotions
  • Sep 1, 2008
  • Christian Becker-Asano + 1 more

In this paper the WASABI Simulation Architecture is introduced, in which a virtual human's cognitive reasoning capabilities are combined with simulated embodiment to achieve the simulation of primary and secondary emotions. In modeling primary emotions we follow the idea of Core Affect in combination with a continuous progression of bodily feeling in three-dimensional emotion space (PAD space), that is only subsequently categorized into discrete emotions. In humans, primary emotions are understood as onto-genetically earlier emotions, which directly influence facial expressions. Secondary emotions, in contrast, afford the ability to reason about current events in the light of experiences and expectations. By technically representing aspects of their connotative meaning in PAD space, we not only assure their mood-congruent elicitation, but also combine them with facial expressions, that are concurrently driven by the primary emotions. An empirical study showed that human players in the Skip-Bo scenario judge our virtual human MAX significantly older when secondary emotions are simulated in addition to primary ones.

  • Research Article
  • Cite Count Icon 179
  • 10.1007/s10458-009-9094-9
Affective computing with primary and secondary emotions in a virtual human
  • May 10, 2009
  • Autonomous Agents and Multi-Agent Systems
  • Christian Becker-Asano + 1 more

We introduce the WASABI ([W]ASABI [A]ffect [S]imulation for [A]gents with [B]elievable [I]nteractivity) Simulation Architecture, in which a virtual human's cognitive reasoning capabilities are combined with simulated embodiment to achieve the simulation of primary and secondary emotions. In modeling primary emotions we follow the idea of Core Affect in combination with a continuous progression of bodily feeling in three-dimensional emotion space (PAD space), that is subsequently categorized into discrete emotions. In humans, primary emotions are understood as onto-genetically earlier emotions, which directly influence facial expressions. Secondary emotions, in contrast, afford the ability to reason about current events in the light of experiences and expectations. By technically representing aspects of each secondary emotion's connotative meaning in PAD space, we not only assure their mood-congruent elicitation, but also combine them with facial expressions, that are concurrently driven by primary emotions. Results of an empirical study suggest that human players in a card game scenario judge our virtual human MAX significantly older when secondary emotions are simulated in addition to primary ones.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.3390/app13095724
Semi-Supervised Learning for Robust Emotional Speech Synthesis with Limited Data
  • May 6, 2023
  • Applied Sciences
  • Jialin Zhang + 3 more

Emotional speech synthesis is an important branch of human–computer interaction technology that aims to generate emotionally expressive and comprehensible speech based on the input text. With the rapid development of speech synthesis technology based on deep learning, the research of affective speech synthesis has gradually attracted the attention of scholars. However, due to the lack of quality emotional speech synthesis corpus, emotional speech synthesis research under low-resource conditions is prone to overfitting, exposure error, catastrophic forgetting and other problems leading to unsatisfactory generated speech results. In this paper, we proposed an emotional speech synthesis method that integrates migration learning, semi-supervised training and robust attention mechanism to achieve better adaptation to the emotional style of the speech data during fine-tuning. By adopting an appropriate fine-tuning strategy, trade-off parameter configuration and pseudo-labels in the form of loss functions, we efficiently guided the learning of the regularized synthesis of emotional speech. The proposed SMAL-ET2 method outperforms the baseline methods in both subjective and objective evaluations. It is demonstrated that our training strategy with stepwise monotonic attention and semi-supervised loss method can alleviate the overfitting phenomenon and improve the generalization ability of the text-to-speech model. Our method can also enable the model to successfully synthesize different categories of emotional speech with better naturalness and emotion similarity.

  • Book Chapter
  • 10.1007/978-981-10-8111-8_11
Using Mandarin Training Corpus to Realize a Mandarin-Tibetan Cross-Lingual Emotional Speech Synthesis
  • Jan 1, 2018
  • Peiwen Wu + 2 more

This paper presents a hidden Markov model (HMM)-based Mandarin-Tibetan cross-lingual emotional speech synthesis by using an emotional Mandarin speech corpus with speaker adaptation. We firstly train a set of average acoustic models by speaker adaptive training with a one-speaker neutral Tibetan corpus and a multi-speaker neutral Mandarin corpus. Then we train a set of speaker dependent acoustic models of target emotion, which are used to synthesize emotional Tibetan or Mandarin speech, by speaker adaptation with the target emotional Mandarin corpus. Subjective evaluations and objective tests show that the method can synthesize both emotional Mandarin speech and emotional Tibetan speech with high naturalness and emotional similarity. Therefore, the method can be adopted to realizing an emotional speech synthesis with exiting emotional training corpus for languages lacking emotional speech resources.

  • Conference Article
  • Cite Count Icon 59
  • 10.21437/icslp.1998-147
Emotional speech synthesis: from speech database to TTS
  • Nov 30, 1998
  • Juan Manuel Montero + 5 more

Modern Speech synthesisers have achieved a high degree of intelligibility, but can not be regarded as natural-sounding devices. In order to decrease the monotony of synthetic speech, the implementation of emotional effects is now being progressively considered. This paper presents a through study of emotional speech in Spanish, and its application to TTS, presenting a prototype system that simulates emotional speech using a commercial synthesiser. The design and recording of a Spanish database will be described and also the analysis of the emotional prosody (by fitting the data to a formal model). Using this collected data, a rule-based simulation of three primary emotions was implemented in the Text-to-Speech system. Finally, the assessment of the synthetic voice through perception experiments will classify the system as capable of producing quality voice with recognisable emotional effects.

  • Conference Article
  • Cite Count Icon 5
  • 10.23919/apsipa.2018.8659599
A DNN-based emotional speech synthesis by speaker adaptation
  • Nov 1, 2018
  • Hongwu Yang + 2 more

The paper proposes a deep neural network (DNN)-based emotional speech synthesis method to improve the quality of synthesized emotional speech by speaker adaptation with a multi-speaker and multi-emotion speech corpus. Firstly, a text analyzer is employed to obtain the contextual labels from sentences while the WORLD vocoder is used to extract the acoustic features from corresponding speeches. Then a set of speaker-independent DNN average voice models are trained with the contextual labels and acoustic features of multi-emotion speech corpus. Finally, the speaker adaptation is adopted to train a set of speaker-dependent DNN voice models of target emotion with target emotional training speeches. The target emotional speech is synthesized by the speaker-dependent DNN voice models. Subjective evaluations show that comparing with the traditional hidden Markov model (HMM)-based method, the proposed method can achieve higher opinion scores. Objective tests demonstrate that the spectrum of the emotional speech synthesized by the proposed method is also closer to the original speech than that of the emotional speech synthesized by the HMM-based method. Therefore, the proposed method can improve the emotion express and naturalness of synthesized emotional speech.

  • Research Article
  • Cite Count Icon 1
  • 10.1007/s10015-013-0126-9
Speech synthesis of emotions using vowel features of a speaker
  • Oct 30, 2013
  • Artificial Life and Robotics
  • Kanu Boku + 3 more

Recently, methods for adding emotion to synthetic speech have received considerable attention in the field of speech synthesis research. We previously proposed a case-based method for generating emotional synthetic speech by exploiting the characteristics of the maximum amplitude and the utterance time of vowels, and the fundamental frequency of emotional speech. In the present study, we propose a method in which our reported method is further improved by controlling the fundamental frequency of emotional synthetic speech. As an initial investigation, we adopted the utterance of a Japanese name that is semantically neutral. By using the proposed method, emotional synthetic speech made from the emotional speech of one male subject was discriminable with a mean accuracy of 83.9 % when 18 subjects listened to the emotional synthetic utterances of "angry," "happy," "neutral," "sad," or "surprised" when the utterance was the Japanese name "Taro," or "Hiroko." Further adjustment of fundamental frequency in the proposed method made a much clearer impression on the subjects for emotional synthetic speech.

  • Book Chapter
  • Cite Count Icon 6
  • 10.1007/978-3-540-85483-8_83
Do You Know How I Feel? Evaluating Emotional Display of Primary and Secondary Emotions
  • Sep 1, 2008
  • Julia Tolksdorf + 2 more

In this paper we report on an empirical study on how well different facial expressions of primary and secondary emotions [2] can be recognized from the face of our emotional virtual human Max [1]. Primary emotions like happiness are more primitive, onto-genetically earlier types of emotions, which are expressed by direct mapping on basic emotion display; secondary emotions like relief or gloating are considered cognitively more elaborated emotions and require a more subtle rendition. In order to validate the design of our virtual agent, which entails devising facial expressions for both kinds of emotion, we tried to find answers to the questions: How well can emotions be read from a virtual agent’s face by human observers? Are there differences in the recognizability between more primitve primary and more cognitively elaborated secondary emotions?

  • Research Article
  • Cite Count Icon 123
  • 10.1037/a0024838
Why group apologies succeed and fail: Intergroup forgiveness and the role of primary and secondary emotions.
  • Jan 1, 2012
  • Journal of Personality and Social Psychology
  • Michael J A Wohl + 2 more

It is widely assumed that official apologies for historical transgressions can lay the groundwork for intergroup forgiveness, but evidence for a causal relationship between intergroup apologies and forgiveness is limited. Drawing on the infrahumanization literature, we argue that a possible reason for the muted effectiveness of apologies is that people diminish the extent to which they see outgroup members as able to experience complex, uniquely human emotions (e.g., remorse). In Study 1, Canadians forgave Afghanis for a friendly-fire incident to the extent that they perceived Afghanis as capable of experiencing uniquely human emotions (i.e., secondary emotions such as anguish) but not nonuniquely human emotions (i.e., primary emotions such as fear). Intergroup forgiveness was reduced when transgressor groups expressed secondary emotions rather than primary emotions in their apology (Studies 2a and 2b), an effect that was mediated by trust in the genuineness of the apology (Study 2b). Indeed, an apology expressing secondary emotions aroused no more forgiveness than a no-apology control (Study 3) and less forgiveness than an apology with no emotion (Study 4). Consistent with an infrahumanization perspective, effects of primary versus secondary emotional expression did not emerge when the apology was offered for an ingroup transgression (Study 3) or when an outgroup apology was delivered through an ingroup proxy (Study 4). Also consistent with predictions, these effects were demonstrated only by those who tended to deny uniquely human qualities to the outgroup (Study 5). Implications for intergroup apologies and movement toward reconciliation are discussed.

  • Research Article
  • Cite Count Icon 56
  • 10.1109/taslp.2022.3145293
MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis
  • Jan 1, 2022
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • Yi Lei + 3 more

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-642-32172-6_11
Speech Synthesis of Emotions Using Vowel Features
  • Jan 1, 2013
  • Kanu Boku + 3 more

Recently, methods for adding emotion to synthetic speech have received considerable attention in the field of speech synthesis research. For generating emotional synthetic speech, it is necessary to control the prosodic features of the utterances. We propose a case-based method for generating emotional synthetic speech by exploiting the characteristics of the maximum amplitude and the utterance time of vowels, and the fundamental frequency of emotional speech. As an initial investigation, we adopted the utterance of Japanese names, which are semantically neutral. By using the proposed method, emotional synthetic speech made from the emotional speech of one male subject was discriminable with a mean accuracy of 70% when ten subjects listened to the emotional synthetic utterances of “angry,” “happy,” “neutral,” “sad,” or “surprised” when the utterance was the Japanese name “Taro.”

  • Research Article
  • Cite Count Icon 1
  • 10.4018/ijsi.2013010105
Speech Synthesis of Emotions Using Vowel Features
  • Jan 1, 2013
  • International Journal of Software Innovation
  • Kanu Boku + 3 more

Recently, methods for adding emotion to synthetic speech have received considerable attention in the field of speech synthesis research. For generating emotional synthetic speech, it is necessary to control the prosodic features of the utterances. The authors propose a case-based method for generating emotional synthetic speech by exploiting the characteristics of the maximum amplitude and the utterance time of vowels, and the fundamental frequency of emotional speech. As an initial investigation, they adopted the utterance of Japanese names, which are semantically neutral. By using the proposed method, emotional synthetic speech made from the emotional speech of one male subject was discriminable with a mean accuracy of 70% when ten subjects listened to the emotional synthetic utterances of “angry,” “happy,” “neutral,” “sad,” or “surprised” when the utterance was the Japanese name “Taro.”

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.