MS-Swinformer and DMTL: Multi-scale spatial fusion and dynamic multi-task learning for speech emotion recognition
MS-Swinformer and DMTL: Multi-scale spatial fusion and dynamic multi-task learning for speech emotion recognition
- Conference Article
12
- 10.23919/apsipaasc55919.2022.9979844
- Nov 7, 2022
Speech emotion recognition (SER) helps achieve better human-computer interaction and thus has attracted extensive attention from industry and academia. Speech emotion intensity plays an important role in the emotional description, but its effect on emotion recognition still has been rarely studied in the area of SER to the best of our knowledge. Previous studies have shown that there is a certain relationship between speech emotion intensity and emotion category, so each recognition task of multi-task learning is supposed to be beneficial to each other. We propose a multi-task learning framework with a self-supervised speech representation extractor based on Wav2Vec 2.0 to detect speech emotion and intensity at the same time in downstream networks. Experiment results show that the multi-task learning framework outperforms SOTA SER models and achieves 5% and 7% SER performance improvement on IEMOCAP and RAVDESS thanks to the auxiliary task of emotion intensity recognition.
- Research Article
17
- 10.1109/access.2022.3189481
- Jan 1, 2022
- IEEE Access
This paper evaluates speech emotion and naturalness recognitions by utilizing deep learning models with multitask learning and single-task learning approaches. The emotion model accommodates valence, arousal, and dominance attributes known as dimensional emotion. The naturalness ratings are labeled on a five-point scale as dimensional emotion. Multitask learning predicts both dimensional emotion (as the main task) and naturalness scores (as an auxiliary task) simultaneously. The single-task learning predicts either dimensional emotion (valence, arousal, and dominance) or naturalness score independently. The results with multitask learning show improvement from previous studies on single-task learning for both dimensional emotion recognition and naturalness predictions. Within this study, single-task learning still shows superiority over multitask learning for naturalness recognition. The scatter plots of emotion and naturalness prediction scores against the true labels in multitask learning exhibit the lack of the model; it fails to predict the low and extremely high scores. The low score of naturalness prediction in this study is possibly due to a low number of samples of unnatural speech samples since the MSP-IMPROV dataset promotes the naturalness of speech. The finding that is jointly predicting naturalness with emotion helps improve the performance of emotion recognition may be embodied in the emotion recognition model in future work.
- Research Article
11
- 10.1186/s13293-024-00589-0
- Feb 13, 2024
- Biology of Sex Differences
BackgroundMajor depressive disorder (MDD) is a recurring affective disorder that is two times more prevalent in females than males. Evidence supports immune system dysfunction as a major contributing factor to MDD, notably in a sexually dimorphic manner. Nuclear factor erythroid 2-related factor 2 (Nrf2), a regulator of antioxidant signalling during inflammation, is dysregulated in many chronic inflammatory disorders; however, its role in depression and the associated sex differences have yet to be explored. Here, we investigated the sex-specific antidepressant and cognitive effects of the potent Nrf2 activator dimethyl fumarate (DMF), as well as the associated gene expression profiles.MethodsMale and female rats were treated with vehicle or DMF (25 mg/kg) whilst subjected to 8 weeks of chronic unpredictable stress. The effect of DMF treatment on stress-induced depression- and anxiety-like behaviours, as well as deficits in recognition and spatial learning and memory were then assessed. Sex differences in hippocampal (HIP) gene expression responses were also evaluated.ResultsDMF treatment during stress exposure had antidepressant effects in male but not female rats, with no anxiolytic effects in either sex. Recognition learning and memory and spatial learning and memory were impaired in chronically stressed males and females, respectively, and DMF treatment rescued these deficits. Further, chronic stress elicited sex-specific alterations in HIP gene expression, many of which were normalized in animals treated with DMF. Of note, most of the differentially expressed genes in males normalized by DMF were related to antioxidant, inflammatory or immune responses.ConclusionsCollectively, these findings may support a greater role of immune processes in males than females in a rodent model of depression. This suggests that pharmacotherapies that target Nrf2 have the potential to be an effective sex-specific treatment for depression.
- Research Article
- 10.3390/electronics14050844
- Feb 21, 2025
- Electronics
In recent years, substantial research has focused on emotion recognition using multi-stream speech representations. In existing multi-stream speech emotion recognition (SER) approaches, effectively extracting and fusing speech features is crucial. To overcome the bottleneck in SER caused by the fusion of inter-feature information, including challenges like modeling complex feature relations and the inefficiency of fusion methods, this paper proposes an SER framework based on multi-task learning, named AFEA-Net. The framework consists of a speech emotion alignment learning (SEAL), an acoustic feature excitation-and-aggregation mechanism (AFEA), and a continuity learning. First, SEAL aligns sentiment information between WavLM and Fbank features. Then, we design an acoustic feature excitation-and-aggregation mechanism to adaptively calibrate and merge the two features. Furthermore, we introduce a continuity learning strategy to explore the distinctiveness and complementarity of dual-stream features from intra- and inter-speech. Experimental results on the publicly available IEMOCAP and RAVDESS sentiment datasets show that our proposed approach outperforms state-of-the-art SER approaches. Specifically, we achieve 75.1% WA, 75.3% UAR, 76% precision, and 75.4% F1-score on IEMOCAP, and 80.3%, 80.6%, 80.8%, and 80.4% on RAVDESS, respectively.
- Research Article
1
- 10.3390/jimaging11080273
- Aug 14, 2025
- Journal of Imaging
Emotion recognition in speech is essential for enhancing human–computer interaction (HCI) systems. Despite progress in Bangla speech emotion recognition, challenges remain, including low accuracy, speaker dependency, and poor generalization across emotional expressions. Previous approaches often rely on traditional machine learning or basic deep learning models, struggling with robustness and accuracy in noisy or varied data. In this study, we propose a novel multi-stream deep learning feature fusion approach for Bangla speech emotion recognition, addressing the limitations of existing methods. Our approach begins with various data augmentation techniques applied to the training dataset, enhancing the model’s robustness and generalization. We then extract a comprehensive set of handcrafted features, including Zero-Crossing Rate (ZCR), chromagram, spectral centroid, spectral roll-off, spectral contrast, spectral flatness, Mel-Frequency Cepstral Coefficients (MFCCs), Root Mean Square (RMS) energy, and Mel-spectrogram. Although these features are used as 1D numerical vectors, some of them are computed from time–frequency representations (e.g., chromagram, Mel-spectrogram) that can themselves be depicted as images, which is conceptually close to imaging-based analysis. These features capture key characteristics of the speech signal, providing valuable insights into the emotional content. Sequentially, we utilize a multi-stream deep learning architecture to automatically learn complex, hierarchical representations of the speech signal. This architecture consists of three distinct streams: the first stream uses 1D convolutional neural networks (1D CNNs), the second integrates 1D CNN with Long Short-Term Memory (LSTM), and the third combines 1D CNNs with bidirectional LSTM (Bi-LSTM). These models capture intricate emotional nuances that handcrafted features alone may not fully represent. For each of these models, we generate predicted scores and then employ ensemble learning with a soft voting technique to produce the final prediction. This fusion of handcrafted features, deep learning-derived features, and ensemble voting enhances the accuracy and robustness of emotion identification across multiple datasets. Our method demonstrates the effectiveness of combining various learning models to improve emotion recognition in Bangla speech, providing a more comprehensive solution compared with existing methods. We utilize three primary datasets—SUBESCO, BanglaSER, and a merged version of both—as well as two external datasets, RAVDESS and EMODB, to assess the performance of our models. Our method achieves impressive results with accuracies of 92.90%, 85.20%, 90.63%, 67.71%, and 69.25% for the SUBESCO, BanglaSER, merged SUBESCO and BanglaSER, RAVDESS, and EMODB datasets, respectively. These results demonstrate the effectiveness of combining handcrafted features with deep learning-based features through ensemble learning for robust emotion recognition in Bangla speech.
- Research Article
20
- 10.1080/00207454.2020.1830086
- Jan 18, 2021
- International Journal of Neuroscience
Purpose/Aim: Infection and inflammation are important pathological mechanism underlying neurodegenerative disorders and altered behavioral outcomes including learning and memory. The present study was designed to study the curative and preventive effects of agmatine in lipopolysaccharide (LPS)-induced learning and memory impairment in mice. Materials and methods Learning and memory functions in animals were evaluated by using Novel object recognition (NOR) and Morris water maze (MWM) tests. Following 7 days of LPS administration, animals were subjected to NOR test on Day-8 and MWM test on Days-9 to 13 for the assessment of recognition and spatial learning and memory, respectively. Results LPS administration produced significant deficits in recognition and spatial memory in mice after seven days of LPS administration. In LPS pre-treated mice, agmatine treatment on Day-8 resulted in the increased exploration to the novel object. Agmatine treatment (Day 8-12) in mice showed reduction in the escape latency and time spent in the target quadrant (probe trial) in the MWM test. However, co-administration of agmatine with LPS in mice for 7 days showed higher discrimination index in NOR test on Day-8. This co-administration also decreased escape latency and time spent in the target quadrant in MWM test on Days 9-13 as compared to LPS control group. Conclusion Results implies the protective and curative effects of agmatine against LPS-induced loss of memory functions in experimental animals. Highlights Subchronic but not acute lipopolysaccharides induce memory deficits Lipopolysaccharides impairs recognition and spatial memory in mice. Agmatine prevents lipopolysaccharides-induced loss of memory. Agmatine reverses deficits in learning and memory by lipopolysaccharides.
- Research Article
20
- 10.1177/2059204318762650
- Jan 1, 2018
- Music & Science
The acoustic cues that convey emotion in speech are similar to those that convey emotion in music, and recognition of emotion in both of these types of cue recruits overlapping networks in the brain. Given the similarities between music and speech prosody, developmental research is uniquely positioned to determine whether recognition of these cues develops in parallel. In the present study, we asked 60 children aged 6 to 11 years, and 51 university students, to judge the emotions of 10 musical excerpts, 10 inflected speech clips, and 10 affect burst clips. We presented stimuli intended to convey happiness, sadness, anger, fear, and pride. Each emotion was presented twice per type of stimulus. We found that recognition of emotions in music and speech developed in parallel, and adult-levels of recognition develop later for these stimuli than for affect bursts. We also found that sad stimuli were most easily recognised, followed by happiness, fear, and then anger. In addition, we found that recognition of emotion in speech and affect bursts can predict emotion recognition in music stimuli independently of age and musical training. Finally, although proud speech and affect bursts were not well recognised, children aged eight years and older showed adult-like responses in recognition of proud music.
- Conference Article
3
- 10.1109/icdarw.2019.00021
- Sep 1, 2019
Rather than the visual images, the face recognition of the caricatures is far from the performance of the visual images. The challenge is the extreme non-rigid distortions of the caricatures introduced by exaggerating the facial features to strengthen the characters. In this paper, we propose dynamic multi-task learning based on deep CNNs for cross-modal caricature-visual face recognition. Instead of the conventional multi-task learning with fixed weights of the tasks, the proposed dynamic multi-task learning dynamically updates the weights of tasks according to the importance of the tasks, which enables the training of the networks focus on the hard task instead of being stuck in the overtraining of the easy task. The experimental results demonstrate the effectiveness of the dynamic multi-task learning for caricature-visual face recognition. The performance evaluated on the datasets CaVI and WebCaricature show the superiority over the state-of-art methods. The implementation code is available here.
- Research Article
30
- 10.1371/journal.pone.0220386
- Aug 15, 2019
- PLOS ONE
Emotion recognition plays an important role in human-computer interaction. Previously and currently, many studies focused on speech emotion recognition using several classifiers and feature extraction methods. The majority of such studies, however, address the problem of speech emotion recognition considering emotions solely from the perspective of a single language. In contrast, the current study extends monolingual speech emotion recognition to also cover the case of emotions expressed in several languages that are simultaneously recognized by a complete system. To address this issue, a method, which provides an effective and powerful solution to bilingual speech emotion recognition, is proposed and evaluated. The proposed method is based on a two-pass classification scheme consisting of spoken language identification and speech emotion recognition. In the first pass, the language spoken is identified; in the second pass, emotion recognition is conducted using the emotion models of the language identified. Based on deep learning and the i-vector paradigm, bilingual emotion recognition experiments have been conducted using the state-of-the-art English IEMOCAP (four emotions) and German FAU Aibo (five emotions) corpora. Two classifiers along with i-vector features were used and compared, namely, fully connected deep neural networks (DNN) and convolutional neural networks (CNN). In the case of DNN, 64.0% and 61.14% unweighted average recalls (UARs) were obtained using the IEMOCAP and FAU Aibo corpora, respectively. When using CNN, 62.0% and 59.8% UARs were achieved in the case of the IEMOCAP and FAU Aibo corpora, respectively. These results are very promising, and superior to those obtained in similar studies on multilingual or even monolingual speech emotion recognition. Furthermore, an additional baseline approach for bilingual speech emotion recognition was implemented and evaluated. In the baseline approach, six common emotions were considered, and bilingual emotion models were created, trained on data from the two languages. In this case, 51.2% and 51.5% UARs for six emotions were obtained using DNN and CNN, respectively. The results using the baseline method were reasonable and promising, showing the effectiveness of using i-vectors and deep learning in bilingual speech emotion recognition. On the other hand, the proposed two-pass method based on language identification showed significantly superior performance. Furthermore, the current study was extended to also deal with multilingual speech emotion recognition using corpora collected under similar conditions. Specifically, the English IEMOCAP, the German Emo-DB, and a Japanese corpus were used to recognize four emotions based on the proposed two-pass method. The results obtained were very promising, and the differences in UAR were not statistically significant compared to the monolingual classifiers.
- Research Article
39
- 10.1007/s11357-016-9947-5
- Sep 9, 2016
- AGE
Age-related cognitive decline has been associated with changes in endogenous hormones and epigenetic modification of chromatin, including histone acetylation. Developmental exposure to endocrine disrupting chemicals, such as bisphenol-A (BPA) that produces endocrine disruption and epigenetic changes, may be a risk factor for accelerating cognitive deficits during aging. Thus, we exposed CD-1 mice to BPA (0, 1, and 100mg/l BPA in the drinking water) orally during puberty (from postnatal days 28 to 56) and investigated whether pubertal BPA exposure exacerbates the age-related impairment of spatial cognition in old age (18months old) and whether serum sex and thyroid hormones or hippocampal histone acetylation (H3K9ac and H4K8ac) are associated with cognitive effects. A young control group (6months old) was added to analyze the age effect. Results showed untreated aged mice had marked decline of spatial learning and memory in the novel location recognition and radial six-arm water maze tasks, with decreased levels of these hormones and hippocampal H3K9ac and H4K8ac compared to young controls. The BPA treatment exacerbated age-related spatial cognitive impairment and accelerated the reduction of free thyroxine (FT4), H3K9ac, and H4K8ac, and the 100mg/l BPA group showed more significant impact. Additionally, correlation analyses revealed that lower levels of FT4, H3K9ac, and H4K8ac were accompanied by decreased spatial memory abilities. We concluded that accelerated reduction of serum FT4 and hippocampal H3K9ac and H4K8ac might be linked to exacerbation of age-related spatial cognitive impairment due to pubertal BPA exposure.
- Research Article
2
- 10.1007/s10032-021-00364-6
- Mar 16, 2021
- International Journal on Document Analysis and Recognition (IJDAR)
Face recognition of realistic visual images (e.g., photos) has been well studied and made significant progress in the recent decade. However, face recognition between realistic visual images/photos and caricatures is still a challenging problem. Unlike the photos, the different artistic styles of caricatures introduce extreme non-rigid distortions of caricatures. The great representational gap between the different modalities of photos and caricatures is a big challenge for photo-caricature face recognition. In this paper, we propose to conduct cross-modal photo-caricature face recognition via multi-task learning, which can learn the features of different modalities with different tasks. Instead of manually setting the task weights as in conventional multi-task learning, this work proposes a dynamic weights learning module which can automatically generate/learn task weights according to the training importance of tasks. The learned task weights enable the network to focus on training the hard tasks instead of being stuck in the overtraining of easy tasks. The experimental results demonstrate the effectiveness of the proposed dynamic multi-task learning for cross-modal photo-caricature face recognition. The performance on the datasets CaVI and WebCaricature show the superiority over the state-of-art methods. The implementation code is provided here. ( https://github.com/hengxyz/cari-visual-recognition-via-multitask-learning.git ).
- Research Article
- 10.1504/ijwmc.2021.10035664
- Jan 1, 2021
- International Journal of Wireless and Mobile Computing
Aiming at the limitations of characteristic performance and network performance in cross-language emotional speech recognition, this paper proposes a multi-model fusion framework based on multi-input cross-language emotional speech recognition. First of all, four kinds of emotional speech shared by four languages are selected as the experimental samples. Secondly, the affective features of three different modes of multi-lingual emotional speech signals are combined with SVM and two deep neural networks (MobileNet26 and ResNet38) to form the basic framework of multi-input corresponding multi-model fusion, in which the feature map in the deep neural network model carries out global maximum pooling and global average pooling so as to capture different features to double the diversity of the model. Finally, through the comparative experimental results, it is found that the multi-model fusion framework can distinguish the emotional differences of multiple languages more effectively than the single network model. At the same time, through the learning of large languages, we can also achieve transfer learning in small language emotional speech recognition to effectively increase the learning ability of the model.
- Research Article
5
- 10.1504/ijwmc.2021.113221
- Jan 1, 2021
- International Journal of Wireless and Mobile Computing
Aiming at the limitations of characteristic performance and network performance in cross-language emotional speech recognition, this paper proposes a multi-model fusion framework based on multi-input cross-language emotional speech recognition. First of all, four kinds of emotional speech shared by four languages are selected as the experimental samples. Secondly, the affective features of three different modes of multi-lingual emotional speech signals are combined with SVM and two deep neural networks (MobileNet26 and ResNet38) to form the basic framework of multi-input corresponding multi-model fusion, in which the feature map in the deep neural network model carries out global maximum pooling and global average pooling so as to capture different features to double the diversity of the model. Finally, through the comparative experimental results, it is found that the multi-model fusion framework can distinguish the emotional differences of multiple languages more effectively than the single network model. At the same time, through the learning of large languages, we can also achieve transfer learning in small language emotional speech recognition to effectively increase the learning ability of the model.
- Book Chapter
- 10.1007/978-3-319-24033-6_11
- Jan 1, 2015
The speech signals are non-stationary processes with changes in time and frequency. The structure of a speech signal is also affected by the presence of several paralinguistics phenomena such as emotions, pathologies, cognitive impairments, among others. Non-stationarity can be modeled using several parametric techniques. A novel approach based on time dependent auto-regressive moving average TARMA is proposed here to model the non-stationarity of speech signals. The model is tested in the recognition of fear-typeo emotions in speech. The proposed approach is applied to model syllables and unvoiced segments extracted from recordings of the Berlin and enterface05 databases. The results indicate that TARMA models can be used for the automatic recognition of emotions in speech.
- Research Article
14
- 10.1016/j.iswa.2024.200351
- Mar 11, 2024
- Intelligent Systems with Applications
In-depth investigation of speech emotion recognition studies from past to present –The importance of emotion recognition from speech signal for AI–
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.