Fusion Based AER System Using Deep Learning Approach for Amplitude and Frequency Analysis
Automatic emotion recognition from Speech (AERS) systems based on acoustical analysis reveal that some emotional classes persist with ambiguity. This study employed an alternative method aimed at providing deep understanding into the amplitude–frequency, impacts of various emotions in order to aid in the advancement of near term, more effectively in classifying AER approaches. The study was undertaken by converting narrow 20 ms frames of speech into RGB or grey-scale spectrogram images. The features have been used to fine-tune a feature selection system that had previously been trained to recognise emotions. Two different Linear and Mel spectral scales are used to demonstrate a spectrogram. An inductive approach for in sighting the amplitude and frequency features of various emotional classes. We propose a two-channel profound combination of deep fusion network model for the efficient categorization of images. Linear and Mel- spectrogram is acquired from Speech-signal, which is prepared in the recurrence area to input Deep Neural Network. The proposed model Alex-Net with five convolutional layers and two fully connected layers acquire most vital features form spectrogram images plotted on the amplitude-frequency scale. The state-of-the-art is compared with benchmark dataset (EMO-DB). RGB and saliency images are fed to pre-trained Alex-Net tested both EMO-DB and Telugu dataset with an accuracy of 72.18% and fused image features less computations reaching to an accuracy 75.12%. The proposed model show that Transfer learning predict efficiently than Fine-tune network. When tested on Emo-DB dataset, the propȯsed system adequately learns discriminant features from speech spectrȯgrams and outperforms many stȧte-of-the-art techniques.
- Research Article
1
- 10.52228/jrub.2023-36-2-10
- Dec 31, 2023
- Journal of Ravishankar University (PART-B)
Automatic Speech Emotion Recognition (ASER) is a state-of-the-art application in artificial intelligence. Speech recognition intelligence is employed in various applications such as digital assistance, security, and other human-machine interactive products. In the present work, three open-source acoustic datasets, namely SAVEE, RAVDESS, and EmoDB, have been utilized (Haq et al., 2008, Livingstone et al., 2005, Burkhardt et al., 2005). From these datasets, six emotions namely anger, disgust, fear, happy, neutral, and sad, are selected for automatic speech emotion recognition. Various types of algorithms are already reported for extracting emotional content from acoustic signals. This work proposes a time-frequency (t-f) image-based multiclass speech emotion classification model for the six emotions mentioned above. The proposed model extracts 472 grayscale image features from the t-f images of speech signals. The t-f image is a visual representation of the time component and frequency component at that time in the two-dimensional space, and differing colors show its amplitude. An artificial neural network-based multiclass machine learning approach is used to classify selected emotions. The experimental results show that the above-mentioned emotions' average classification accuracy (CA) of 88.6%, 85.5%, and 93.56% is achieved using SAVEE, RAVDESS, and EmoDB datasets, respectively. Also, an average CA of 83.44% has been achieved for the combination of all three datasets. The maximum reported average classification accuracy (CA) using spectrogram for SAVEE, RAVDESS, and EmoDB dataset is 87.8%, 79.5 %, and 83.4%, respectively (Wani et al., 2020, Mustaqeem and Kwon, 2019, Badshah et al., 2017). The proposed t-f image-based classification model shows improvement in average CA by 0.91%, 7.54%, and 12.18 % for SAVEE, RAVDESS, and EmoDB datasets, respectively. This study can be helpful in human-computer interface applications to detect emotions precisely from acoustic signals.
- Research Article
41
- 10.1016/j.csl.2012.11.003
- Dec 20, 2012
- Computer Speech & Language
Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications
- Research Article
9
- 10.3390/electronics13142689
- Jul 10, 2024
- Electronics
Speech emotion recognition (SER) plays an important role in human-computer interaction (HCI) technology and has a wide range of application scenarios in medical medicine, psychotherapy, and other applications. In recent years, with the development of deep learning, many researchers have combined feature extraction technology with deep learning technology to extract more discriminative emotional information. However, a single speech emotion classification task makes it difficult to effectively utilize feature information, resulting in feature redundancy. Therefore, this paper uses speech feature enhancement (SFE) as an auxiliary task to provide additional information for the SER task. This paper combines Long Short-Term Memory Networks (LSTM) with soft decision trees and proposes a multi-task learning framework based on a decision tree structure. Specifically, it trains the LSTM network by computing the distances of features at different leaf nodes in the soft decision tree, thereby achieving enhanced speech feature representation. The results show that the algorithm achieves 85.6% accuracy on the EMO-DB dataset and 81.3% accuracy on the CASIA dataset. This represents an improvement of 11.8% over the baseline on the EMO-DB dataset and 14.9% on the CASIA dataset, proving the effectiveness of the method. Additionally, we conducted cross-database experiments, real-time performance analysis, and noise environment analysis to validate the robustness and practicality of our method. The additional analyses further demonstrate that our approach performs reliably across different databases, maintains real-time processing capabilities, and is robust to noisy environments.
- Research Article
11
- 10.1155/2024/7184018
- Jan 1, 2024
- Applied Computational Intelligence and Soft Computing
Speech emotion recognition (SER) is a challenging task due to the complex and subtle nature of emotions. This study proposes a novel approach for emotion modeling using speech signals by combining discrete wavelet transform (DWT) with linear prediction coding (LPC). The performance of various classifiers, including support vector machine (SVM), K‐Nearest Neighbors (KNN), Efficient Logistic Regression, Naive Bayes, Ensemble, and Neural Network, was evaluated for emotion classification using the EMO‐DB dataset. Evaluation metrics such as area under the curve (AUC), average prediction accuracy, and cross‐validation techniques were employed. The results indicate that KNN and SVM classifiers exhibited high accuracy in distinguishing sadness from other emotions. Ensemble methods and Neural Networks also demonstrated strong performance in sadness classification. While Efficient Logistic Regression and Naive Bayes classifiers showed competitive performance, they were slightly less accurate compared to other classifiers. Furthermore, the proposed feature extraction method yielded the highest average accuracy, and its combination with formants or wavelet entropy further improved classification accuracy. On the other hand, Efficient Logistic Regression exhibited the lowest accuracies among the classifiers. The uniqueness of this study was that it investigated a combined feature extraction method and integrated them to compare with various forms of combinations. However, the purposes of the investigation include improved performance of the classifiers, high effectiveness of the system, and the potential for emotion classification tasks. These findings can guide the selection of appropriate classifiers and feature extraction methods in future research and real‐world applications. Further investigations can focus on refining classifiers and exploring additional feature extraction techniques to enhance emotion classification accuracy.
- Conference Article
4
- 10.1109/icme.2014.6890208
- Jul 1, 2014
There are two main emotion annotation techniques: multidimensional and categories based. In order to conduct experiments on emotional data annotated with different techniques, two-classes emotion mapping strategies (e.g. high-vs. low-arousal) are commonly used. The "affective computing" community could not specify a location of emotionally neutral area in multi-dimensional emotional space (e.g. valence-arousal-dominance (VAD)). Nonetheless, in the current research a neutral state is added to the standard two-classes emotion classification task. Within experiments a possible location of a neutral arousal region in valence-arousal space was determined. We employed general and phonetic pattern dependent emotion classification techniques for cross-corpora experiments. Emotional models were trained on the VAM dataset (multi-dimensional annotation) and evaluated them on the EMO-DB dataset (categories based annotation).
- Research Article
22
- 10.1016/j.knosys.2023.110814
- Jul 25, 2023
- Knowledge-Based Systems
Cross Corpus Speech Emotion Recognition using transfer learning and attention-based fusion of Wav2Vec2 and prosody features
- Research Article
28
- 10.1016/j.neucom.2024.128177
- Jul 9, 2024
- Neurocomputing
Speech emotion recognition based on multi-feature speed rate and LSTM
- Conference Article
4
- 10.1109/ubmk55850.2022.9919468
- Sep 14, 2022
EEG signals are one of the most basic methods used in identifying and analyzing brain activities. Visual representation of EEG signals can be achieved with spectrograms. Spectrograms represent a visual representation of a signal's signal strength over time. In this study, the signals in an EEG dataset containing ‘positive’, ‘negative’ and ‘neutral’ emotion classes were classified with a deep learning model, and then these signals were transformed into a spectrogram image in the dataset with convolutional network model and also with transfer learning (EfficientNet and XceptionNet). Multiple classification was performed with pre-trained models. The success value obtained by the classification of the EEG signals and the success of the visualization in this classification were measured and presented by comparison. While higher accuracy values were achieved in the classification of signals with the deep network model, in metrics such as precision and F1-score, the classification of images with the proposed convolutional network model achieved much higher performance.
- Conference Article
47
- 10.1109/rios.2017.7956455
- Apr 1, 2017
Emotion plays an important role in human daily life and is a significant feature for interaction among people. Due to having adaptive role, it motivate human to respond stimuli in their environment quickly for improving their communication, learning and decision-making. With increasing role of brain computer interface (BCI) in interaction between users and computer, automatic emotion recognition has become an interesting area in the past decade. Emotion recognition could be carried out from the facial expression, gesture, speech and text, and could be record in several ways, like Electroencephalography (EEG), Positron Emission Tomography (PET), Magnetic Resonance Imaging (MRI), etc. In this work, feature extraction and classification of emotions have been evaluated on different methods to recognize and classify six emotional states such as fear, sad, frustrated, happy, pleasant and satisfied from inner emotion EEG signals. The results showed that using appropriate feature for extraction emotional state such as Discrete Wavelet Transform (DWT) and suitable learner such as Aftificial Neural Network (ANN), recognizer system can be accurately.
- Research Article
54
- 10.1016/j.compedu.2024.105111
- Jul 11, 2024
- Computers & Education
Bridging computer and education sciences: A systematic review of automated emotion recognition in online learning environments
- Research Article
44
- 10.3390/electronics12102232
- May 14, 2023
- Electronics
Automatic emotion recognition from electroencephalogram (EEG) signals can be considered as the main component of brain–computer interface (BCI) systems. In the previous years, many researchers in this direction have presented various algorithms for the automatic classification of emotions from EEG signals, and they have achieved promising results; however, lack of stability, high error, and low accuracy are still considered as the central gaps in this research. For this purpose, obtaining a model with the precondition of stability, high accuracy, and low error is considered essential for the automatic classification of emotions. In this research, a model based on Deep Convolutional Neural Networks (DCNNs) is presented, which can classify three positive, negative, and neutral emotions from EEG signals based on musical stimuli with high reliability. For this purpose, a comprehensive database of EEG signals has been collected while volunteers were listening to positive and negative music in order to stimulate the emotional state. The architecture of the proposed model consists of a combination of six convolutional layers and two fully connected layers. In this research, different feature learning and hand-crafted feature selection/extraction algorithms were investigated and compared with each other in order to classify emotions. The proposed model for the classification of two classes (positive and negative) and three classes (positive, neutral, and negative) of emotions had 98% and 96% accuracy, respectively, which is very promising compared with the results of previous research. In order to evaluate more fully, the proposed model was also investigated in noisy environments; with a wide range of different SNRs, the classification accuracy was still greater than 90%. Due to the high performance of the proposed model, it can be used in brain–computer user environments.
- Research Article
36
- 10.1016/j.apmr.2003.09.007
- Jul 1, 2004
- Archives of Physical Medicine and Rehabilitation
Pressure mapping in seating: a frequency analysis approach
- Research Article
1
- 10.29284/ijasis.4.2.2018.31-37
- Dec 28, 2018
- International Journal of Advances in Signal and Image Sciences
Facial expression analysis (FEA) or Human Emotion Analysis (HEA) is an essential tool for human computer interaction. The nonverbal messages of humans are expressed by facial expression. In this study, an HEA system to classify seven classes of human emotions like happy, sad, angry, disgust, fear, surprise and neutral is presented. It uses Gabor filter for feature extraction and Multiple Instance Learning (MIL) for classification. Gabor filter analyzes the facial images in a localized region to extract specific frequency content in specific directions. Then, MIL classifier is used for the classification of emotions into any one of the seven emotions. The evaluation of HEA system is carried on JApanese Female Facial Expression (JAFFE) database. The overall recognition rate of the HEA system using Gabor and MIL technique is 95%.
- Conference Article
4
- 10.1109/cscloud/edgecom.2019.00-16
- Jun 1, 2019
Emotion detection or recognition from speech is currently a very crucial area of research with a plethora of applications in day to day life. Human communication depends heavily on mood, emotions and feelings. Availability of advanced signal processing techniques and artificial intelligence techniques like machine learning architecture (shallow classifiers) and neural network architecture (deep classifiers) have made this domain a booming area of research with increased efficiency and accuracy. This paper aims to empirically analyze various statistical machine learning algorithms like Naive Bayes, Support Vector Machine, Random Forest and deep learning algorithms like Convolutional Neural Network, Long Short Term Memory over emodb dataset which is publicly available for emotion classification into angry, sad, happy, neutral, other classes. A comparison of shallow classifiers on the basis of accuracy will help future researchers in providing hindsight into the field of emotion detection. Same goes for the comparison between the deep learning techniques.
- Research Article
1
- 10.1016/j.compbiomed.2025.110510
- Aug 1, 2025
- Computers in biology and medicine
Feature and classifier-level domain adaptation in DistilHuBERT for cross-corpus speech emotion recognition.