Speech Signal Imaging and Emotion Recognition Based on Symmetric-Diagonal Matrix Model
Speaking is the main medium of communication in our daily life. It can convey the thought and express the emotional state between humans. The goal of speech emotion recognition is to recognize human emotional states from speaking. The speech emotion recognition mainly includes two steps: feature extraction and classifier construction. Taking speech features usually refers to its spectral features. The speech signal input is an approximately continuous value, and the spectral features contain considerable number of information, such as speech content, rhythm, tone, intonation, and so on. However, related with the emotion speech feature extraction is still an immature research direction. Influenced by the success of the computer vision, the visualization of speech signals has become a new method for analysis emotion recognition on the acoustic features of speech. Based on the Graham Angle Field Method, this research using a variety of neural network models to extract speech feature values and recognizing speech emotions. According to the experiments, this research find that it is feasible to visualize speech signals and use the obtained results for emotion recognition. In the future, we will further optimize the network model. Combing with other acoustic features, such as speech content, rhythm. etc., completing speech emotion recognition in our real life.
- Conference Article
18
- 10.1109/gcce53005.2021.9621810
- Oct 12, 2021
In this study, a speech emotion recognition method that uses both acoustic and linguistic features is studied. Various emotion recognition methods using both the abovementioned types of features have been proposed. However, most studies that use linguistic features are based on reference transcripts because emotional speech recognition is considered more difficult than non-emotional speech recognition. The acoustic features of emotional speech differ from those of non-emotional speech, and these features vary greatly depending on the emotion type and intensity. We have been studying a new emotional speech recognition method that uses a combination of both acoustic model and language model adaptation and thereby achieved high recognition performance on an emotional speech task. In this study, we attempt to extract linguistic features using speech recognition results. The word recognition accuracy of the system was 82.2%, and recognition errors were observed. Despite this, the linguistic features extracted from the recognition results are useful, and we demonstrate that the combination of linguistic and acoustic features is effective for emotion recognition.
- Conference Article
1
- 10.21437/interspeech.2009-585
- Sep 6, 2009
As a hot research field, speech emotion recognition has attracted increasing attentions from both academic and business. In this paper, we proposed a method to recognize speech emotions adopting ANNs and to fuse two kinds of recognitions using different features at the decision level. Each emotional utterance is recognized by some individual recognizers firstly. Then the outputs of these recognizers were fused adopting the voting strategy. Furthermore, the dimensionality of supervectors constructed from spectral features is reduced through PCA. Experimental results demonstrated that the proposed decision fusion is effective and the dimensionality reduction is feasible. Index Terms : speech emotion recognition, ANN, decision fusion 1. Introduction Speech is a dominant tool for communication, and it is also an important and effective approach for transmitting information and human emotions. With the increasing role of speech interfaces in human-machine interaction applications, speech emotion recognition becomes more and more important recently. Speech emotion recognition is an interesting and challenging speech technology, which can be applied to broad areas, such as environment of call center [1], treatment of mental and psychological diseases [2], development of education and entertainment software [3], and so on. Speech emotion recognition deals with how to make the computer automatically recognize various emotions in speech signal by extracting and analyzing some acoustic features. A key problem of speech emotion recognition is that which kinds of speech features can be used to represent human emotions. Some researchers have investigated the relations between features and emotions. With their efforts, many speech features were found to be used for emotion recognition. Statistical features based on prosody and voice quality have been widely used in speech emotion recognition and demonstrated considerable recognition success [4, 5]. Besides statistical features, spectral or cepstral features are another effective group for describing emotional states [6, 7]. Since these features all have played significant roles in speech emotion recognition, it is necessary to explore an effective way to complementarily fuse both two kinds of features to further enhance the performance of emotion recognition.Another key issue of speech emotion recognition is how to choose an effective method to classify speech emotions. So far, many pattern classification methods have been used for speech emotion recognition [6-9], such as Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Artificial Neural Networks (ANN), and so on. These methods are all feasible, but their performances are different with each other seriously. The SVMs based method has been shown to be robust and performs well. But some fresh researches have indicated that ideal performance may be obtained using ANNs as well. However, it is difficult to determine which kind of ANNs is suitable for emotion recognition and it is necessary to compare its performance with the SVMs. In this paper, the ANN based decision fusion for speech emotion recognition was presented. Firstly four different ANNs were used to recognize various emotions. Then the voting scheme was adopted to fuse recognitions using two kinds of features at the decision level. Experimental results demonstrated that the proposed approach improved the performance of ANN based recognition and its accuracy was comparable with SVM based method.The remainder of this paper is organized as follows. The features used for speech emotion recognition are introduced in Section 2. The principles of PCA and ANN are briefly described in Section 3 and Section 4 respectively. The proposed decision fusion is depicted in Section 5. In Section 6, experiments and discussions of experimental results are presented. In Section 7, conclusions are drawn and future works are suggested.
- Research Article
16
- 10.1016/j.iswa.2024.200351
- Mar 11, 2024
- Intelligent Systems with Applications
In-depth investigation of speech emotion recognition studies from past to present –The importance of emotion recognition from speech signal for AI–
- Research Article
- 10.55041/ijsrem48705
- May 26, 2025
- INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
-- Speech signals are being considered as most effective means of communication between human beings. Many researchers have found different methods or systems to identify emotions from speech signals. Here, the various features of speech are used to classify emotions. Features like pitch, tone, intensity are essential for classification. Large number of the datasets are available for speech emotion recognition. Firstly, the extraction of features from speech emotion is carried out and then another important part is classification of emotions based upon speech. Hence, different classifiers are used to classify emotions such as Happy, Sad, Anger, Surprise, Neutral, etc. Although, there are other approaches based on machine learning algorithms for identifying emotions. Speech Emotion Recognition is a current research topic because of its wide range of applications and it became a challenge in the field of speech processing too. We have carried out a brief study on Speech Emotion Analysis along with Emotion Recognition. Speech Emotion Recognition (SER) can be defined as extraction of the emotional state of the speaker from his or her speech signal. There are few universal emotions including Neutral, Anger, we have worked on different tools to be used in SER. SER is tough because emotions are subjective and annotating audio is challenging task. Emotion recognition is the part of speech recognition which is gaining more popularity and need for it increases enormously. We have classified based on different type of emotions to detect from speech. Key Words: Speech Emotion Recognition, Affective Computing, Machine Learning, Deep Learning, Audio Signal Processing, Emotion Classification, Feature Extraction, Prosodic Features, Spectral Features, Mel-Frequency Cepstral Coefficients (MFCCs), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Attention Mechanisms, Multimodal Emotion Recognition, Speaker-Independent SER, Real-Time Emotion Detection, Noise-Robust Emotion Recognition, Data Augmentation, Emotion-Aware Applications
- Conference Article
14
- 10.1109/tencon.2015.7372840
- Nov 1, 2015
Emotional information in speech signal is an important information resource. When verbal expression combined with human emotion, emotional speech processing is no longer a simple mathematical model or pure calculations. Fluctuations of the mood are controlled by the brain perception; speech signal processing based on cognitive psychology can capture emotion better. In this paper the relevance analysis between speech emotion and human cognition is introduced firstly. The recent progress in speech emotion recognition was summarized including the review of speech emotion databases, feature extraction and emotion recognition networks. Secondly a fuzzy cognitive map network based on cognitive psychology is introduced into emotional speech recognition. In addition, the mechanism of the human brain for cognitive emotional speech is explored. To improve the recognition accuracy, this report also tries to integrate event-related potentials to speech emotion recognition. This idea is the conception and prospect of speech emotion recognition mashed up with cognitive psychology in the future.
- Book Chapter
- 10.1007/978-3-319-24033-6_11
- Jan 1, 2015
The speech signals are non-stationary processes with changes in time and frequency. The structure of a speech signal is also affected by the presence of several paralinguistics phenomena such as emotions, pathologies, cognitive impairments, among others. Non-stationarity can be modeled using several parametric techniques. A novel approach based on time dependent auto-regressive moving average TARMA is proposed here to model the non-stationarity of speech signals. The model is tested in the recognition of fear-typeo emotions in speech. The proposed approach is applied to model syllables and unvoiced segments extracted from recordings of the Berlin and enterface05 databases. The results indicate that TARMA models can be used for the automatic recognition of emotions in speech.
- Research Article
52
- 10.1109/access.2020.2990974
- Jan 1, 2020
- IEEE Access
Driven by the vision of Internet of Things, some research efforts have already focused on designing a network of efficient speech recognition for the development of edge computing. Other researches (such as tpool2) do not make full use of spatial and temporal information in the acoustic features of speech. In this paper, we propose a compact speech recognition network with spatio-temporal features for edge computing, named EdgeRNN. Alternatively, EdgeRNN uses 1-Dimensional Convolutional Neural Network (1-D CNN) to process the overall spatial information of each frequency domain of the acoustic features. A Recurrent Neural Network (RNN) is used to process the temporal information of each frequency domain of the acoustic features. In addition, we propose a simplified attention mechanism to enhance the portion of the network that contributes to the final identification. The overall performance of EdgeRNN has been verified on speech emotion and keywords recognition. The IEMOCAP dataset is used in speech emotion recognition, and the unweighted average recall (UAR) reaches 63.98%. Speech keywords recognition uses Google's Speech Commands Datasets V1 with a weighted average recall (WAR) of 96.82%. Compared with the experimental results of the related efficient networks on Raspberry Pi 3B+, the accuracies of EdgeRNN have been improved on both of speech emotion and keywords recognition.
- Research Article
20
- 10.1177/2059204318762650
- Jan 1, 2018
- Music & Science
The acoustic cues that convey emotion in speech are similar to those that convey emotion in music, and recognition of emotion in both of these types of cue recruits overlapping networks in the brain. Given the similarities between music and speech prosody, developmental research is uniquely positioned to determine whether recognition of these cues develops in parallel. In the present study, we asked 60 children aged 6 to 11 years, and 51 university students, to judge the emotions of 10 musical excerpts, 10 inflected speech clips, and 10 affect burst clips. We presented stimuli intended to convey happiness, sadness, anger, fear, and pride. Each emotion was presented twice per type of stimulus. We found that recognition of emotions in music and speech developed in parallel, and adult-levels of recognition develop later for these stimuli than for affect bursts. We also found that sad stimuli were most easily recognised, followed by happiness, fear, and then anger. In addition, we found that recognition of emotion in speech and affect bursts can predict emotion recognition in music stimuli independently of age and musical training. Finally, although proud speech and affect bursts were not well recognised, children aged eight years and older showed adult-like responses in recognition of proud music.
- Research Article
40
- 10.1007/s00521-013-1377-z
- Mar 29, 2013
- Neural Computing and Applications
Emotion recognition in speech signals is currently a very active research topic and has attracted much attention within the engineering application area. This paper presents a new approach of robust emotion recognition in speech signals in noisy environment. By using a weighted sparse representation model based on the maximum likelihood estimation, an enhanced sparse representation classifier is proposed for robust emotion recognition in noisy speech. The effectiveness and robustness of the proposed method is investigated on clean and noisy emotional speech. The proposed method is compared with six typical classifiers, including linear discriminant classifier, K-nearest neighbor, C4.5 decision tree, radial basis function neural networks, support vector machines as well as sparse representation classifier. Experimental results on two publicly available emotional speech databases, that is, the Berlin database and the Polish database, demonstrate the promising performance of the proposed method on the task of robust emotion recognition in noisy speech, outperforming the other used methods.
- Research Article
68
- 10.1016/j.apacoust.2023.109492
- Jun 28, 2023
- Applied Acoustics
Emotional speech Recognition using CNN and Deep learning techniques
- Research Article
15
- 10.3390/s23031355
- Jan 25, 2023
- Sensors (Basel, Switzerland)
Speech reflects people’s mental state and using a microphone sensor is a potential method for human–computer interaction. Speech recognition using this sensor is conducive to the diagnosis of mental illnesses. The gender difference of speakers affects the process of speech emotion recognition based on specific acoustic features, resulting in the decline of emotion recognition accuracy. Therefore, we believe that the accuracy of speech emotion recognition can be effectively improved by selecting different features of speech for emotion recognition based on the speech representations of different genders. In this paper, we propose a speech emotion recognition method based on gender classification. First, we use MLP to classify the original speech by gender. Second, based on the different acoustic features of male and female speech, we analyze the influence weights of multiple speech emotion features in male and female speech, and establish the optimal feature sets for male and female emotion recognition, respectively. Finally, we train and test CNN and BiLSTM, respectively, by using the male and the female speech emotion feature sets. The results show that the proposed emotion recognition models have an advantage in terms of average recognition accuracy compared with gender-mixed recognition models.
- Conference Article
12
- 10.1109/icpics47731.2019.8942545
- Jul 1, 2019
In this paper, based on speech multi-modal fusion emotion recognition, the key issues of speech signal preprocessing, feature extraction, fusion strategy, fusion method and emotion recognition classification are studied in depth. In this paper, the construction of emotional recognition theory model and the research of feature fusion and classification recognition algorithms are carried out. Speech-based emotion recognition research processes the speech signal by SVM, and calculates their maximum and minimum values. The specific methods and processes of emotional feature extraction are introduced in detail, and the extracted features are analyzed by emotion classification and recognition. The obtained recognition results verify the validity of the extracted features.
- Research Article
1
- 10.1007/s11055-011-9421-x
- Apr 20, 2011
- Neuroscience and Behavioral Physiology
The experimental-theoretical aims of the present study were to investigate the ability of humans to evaluate emotions in speech in relation to individual EEG characteristics and to compare clinical and electrophysiological data. Profound impairments to the recognition of emotions in speech were seen in subjects with lesions to the right temporal area, while the most significant defects in recognition were associated with frontal-temporal focal lesions. EEG studies of two groups of subjects, with high and low levels of recognition of emotions in speech, showed high levels of activation of the posterior temporal area of the right hemisphere and anterior leads of the left hemisphere in subjects with poor discrimination of the emotional tone of speech. Clinical and electrophysiological data lead to the conclusion that the recognition of emotions in speech may involve not only the temporal area of the right hemisphere, but also the speech centers in the left hemisphere.
- Research Article
8
- 10.17694/bajece.419557
- Apr 30, 2018
- Balkan Journal of Electrical and Computer Engineering
In several
 application, emotion  recognition from
 the speech signal has been research topic since many years. To determine the
 emotions from the speech signal, many systems have been developed. To solve the
 speaker emotion recognition problem, hybrid model is proposed to classify five
 speech emotions, including  anger,
 sadness, fear, happiness and neutral. The aim this study of was to actualize
 automatic voice and speech emotion recognition system using hybrid model taking
 Turkish sound forms and properties into consideration.  Approximately 3000 Turkish voice samples of
 words and clauses with differing lengths have been collected from 25 males
 and  25 females. In this study, an
 authentic and unique  Turkish  database has been used. Features of these
 voice samples have been obtained using Mel Frequency Cepstral Coefficients
 (MFCC) and Mel Frequency Discrete Wavelet Coefficients (MFDWC). Moreover,
 spectral features of these voice samples have been obtained  using Support Vector Machine (SVM). Feature
 vectors of the voice samples obtained have been trained with such methods as
 Gauss Mixture Model( GMM), Artifical Neural Network (ANN), Dynamic Time Warping
 (DTW), Hidden Markov Model (HMM) and hybrid model(GMM with combined SVM).  This hybrid model has been carried out by
 combining with SVM and GMM.  In first
 stage of this model, with SVM has been performed  subsets obtained vector of  spectral features. In the second  phase, a set of training and tests have been
 formed from these spectral features. In the test phase, owner of a given voice
 sample has been identified taking the trained voice samples into consideration.
 Results and performances of the algorithms employed in the study for
 classification have been also demonstrated in a comparative manner.         
- Research Article
4
- 10.1186/1687-6180-2012-15
- Jan 19, 2012
- EURASIP Journal on Advances in Signal Processing
As research in speech processing has matured, attention has gradually shifted from linguistic-related applications such as speech recognition towards paralinguistic speech processing problems, in particular the recognition of speaker identity, language, emotion, gender, and age. Determination of a speaker’s emotion or mental state is a particularly challenging problem, in view of the significant variability in its expression posed by linguistic, contextual, and speaker-specific characteristics within speech. In response, a range of signal processing and pattern recognition methods have been developed in recent years. Recognition of emotion and mental state from speech is a fundamentally multidisciplinary field, comprising contributions from psychology, speech science, linguistics, (cooccurring) nonverbal communication, machine learning, artificial intelligence and signal processing, among others. Some of the key research problems addressed to date include isolating sources of emotion-specific information in the speech signal, extracting suitable features, forming reduced-dimension feature sets, developing machine learning methods applicable to the task, reducing feature variability due to speaker and linguistic content, comparing and evaluating diverse methods, robustness, and constructing suitable databases. Studies examining the relationships between the psychological basis of emotion, the effect of emotion on speech production, and the measurable differences in the speech signal due to emotion have helped to shed light on these problems; however, substantial research is still required. Taking a broader view of emotion as a mental state, signal processing researchers have also explored the possibilities of automatically detecting other types of mental state which share some characteristics with emotion, for example stress, depression, cognitive load, and ‘cognitive epistemic’ states such as interest, scepticism, etc. The recent interest in emotion recognition research has seen applications in call centre analytics, human-machine and humanrobot interfaces, multimedia retrieval, surveillance tasks, behavioural health informatics, and improved speech recognition. This special issue comprises nine articles covering a range of topics in signal processing methods for vocal source and acoustic feature extraction, robustness issues, novel applications of pattern recognition techniques, methods for detecting mental states and recognition of non-prototypical spontaneous and naturalistic emotion in speech. These articles were accepted following peer review, and each submission was handled by an editor who was independent from all authors listed in that manuscript. Herein, we briefly introduce the articles comprising this special issue. Trevino, Quatieri and Malyska bring a new level of sophistication to an old problem, detecting signs of depressive disorders in speech. Their measures of depression come from standard psychiatric instruments, Quick Inventory of Depressive Symptomatology and Hamilton Depression rating scales. These are linked to measures of speech timing that are much richer than the traditional global measures of speech rate. Results indicate that different speech sounds and sound types behave differently in depression, and may relate to different aspects of depression. Caponetti, Buscicchio and Castellano propose the use of a more detailed auditory model than that embodied in the widely employed mel frequency cepstral coefficients, for extracting detailed spectral features during emotion recognition. Working from the Lyon cochlear model, the authors demonstrate improvements on a five-class problem from the speech under simulated and actual stress database. Their study also further validates the applicability of long short-term memory recurrent neural networks for classification in emotion and mental state recognition problems. Callejas, Griol and Lopez-Cozar propose a mental state prediction approach that considers both speaker * Correspondence: j.epps@unsw.edu.au School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW 2052, Australia Full list of author information is available at the end of the article Epps et al. EURASIP Journal on Advances in Signal Processing 2012, 2012:15 http://asp.eurasipjournals.com/content/2012/1/15
- Ask R Discovery
- Chat PDF