Abstract

The production of emotional speech is determined by the movement of the speaker’s tongue, lips, and jaw. In order to combine articulatory data and acoustic data of speakers, articulatory-to-acoustic conversion of emotional speech has been studied. In this paper, parameters of LSSVM model have been optimized using the PSO method, and the optimized PSO-LSSVM model was applied to the articulatory-to-acoustic conversion. The root mean square error (RMSE) and mean Mel-cepstral distortion (MMCD) have been used to evaluate the results of conversion; the evaluated result illustrates that MMCD of MFCC is 1.508 dB, and RMSE of the second formant (F2) is 25.10 Hz. The results of this research can be further applied to the feature fusion of emotion speech recognition to improve the accuracy of emotion recognition.

Highlights

  • In recent decades, the development of artificial intelligence (AI) has been very rapid, and human-computer interaction technology needs that there would be harmoniously communicated relationship between the human being and the intelligent machine [1, 2]

  • Articulatory-to-Acoustic Conversion Model Based on least squares support vector machine (LSSVM). e algorithm flow of feature conversion model based on LSSVM is shown in Figure 1, and the specific process is as follows: (1) Articulatory features x (x1, x2, ..., xM) and acoustic features y (y1, y2, ..., yN) were synchronously extracted from the bimodal emotional speech database

  • A kind of articulatory-to-acoustic conversion based on LSSVM has been proposed, and particle swarm optimization (PSO) optimization algorithm was used to optimize the model parameters, so as to realize the conversion of articulatory features in Mandarin

Read more

Summary

Introduction

The development of artificial intelligence (AI) has been very rapid, and human-computer interaction technology needs that there would be harmoniously communicated relationship between the human being and the intelligent machine [1, 2]. Since the notion of “emotion calculator” was proposed by the MIT Media Lab, many physiological signals have been successively applied as characteristic information in the research of emotional speech recognition, so as to help computers better analyze the emotional state of speakers from their speech signals [1]. As an important part of emotional speech generation, the kinematic data of articulators have not been widely used in the research of speech emotion recognition [4]. The typical method of statistical mapping is codebook-based method, which builds codebook to store the mapping pair of acoustic-articulatory features and applies the algorithm to find the optimal mapping pair to confirm the relationship between articulatory and acoustic features. Is method was first proposed in 1996 by Hogden et al [13], who constructed the mapping relationship between acoustic features and articulatory features, and the method used vector quantization to encode The typical method of statistical mapping is codebook-based method, which builds codebook to store the mapping pair of acoustic-articulatory features and applies the algorithm to find the optimal mapping pair to confirm the relationship between articulatory and acoustic features. is method was first proposed in 1996 by Hogden et al [13], who constructed the mapping relationship between acoustic features and articulatory features, and the method used vector quantization to encode

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call