Advancing guitar emotion recognition through audio data augmentation to enhance smart musical instruments

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Advancing guitar emotion recognition through audio data augmentation to enhance smart musical instruments

Similar Papers
  • Research Article
  • 10.55041/ijsrem37976
Cross-Modal Harmony: Low-Rank Fusion for Enhanced Artificial Emotion Recognition
  • Oct 16, 2024
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Himanshu Kumar + 1 more

Emotion recognition has become a subject of considerable interest in recent times, owing to its diverse and far-reaching applications in various fields. These applications span from enhancing human-computer interactions to assessing mental health and improving entertainment systems. The proposed study presents a novel approach for emotion recognition by fusing audio and video modalities using low-rank fusion techniques. The proposed methodology leverages the complementary nature of audio and video data in capturing emotional cues. Audio data often encapsulates tone, speech patterns, and vocal nuances, while video data captures facial expressions, body language, and gestures. However, the challenge lies in effectively integrating these two modalities to enhance recognition accuracy. To address the challenge, it employs low-rank fusion, a dimensionality reduction technique that extracts the most informative features from both modalities while minimizing redundancy. Furthermore, it presents the implementation of the chosen low-rank fusion algorithm in a real-world emotion recognition system. The results can contribute to advancing the field of emotion recognition by providing a practical and efficient solution for combining audio and video data to achieve more robust and accurate emotion classification. Keywords: Deep Learning; Emotion Recognition; Human-Computer Interaction; Low-Rank Fusion; Multimodal Fusion.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.3389/fcomp.2023.1039261
Task-specific speech enhancement and data augmentation for improved multimodal emotion recognition under noisy conditions
  • Mar 28, 2023
  • Frontiers in Computer Science
  • Shruti Kshirsagar + 2 more

Automatic emotion recognition (AER) systems are burgeoning and systems based on either audio, video, text, or physiological signals have emerged. Multimodal systems, in turn, have shown to improve overall AER accuracy and to also provide some robustness against artifacts and missing data. Collecting multiple signal modalities, however, can be very intrusive, time consuming, and expensive. Recent advances in deep learning based speech-to-text and natural language processing systems, however, have enabled the development of reliable multimodal systems based on speech and text while only requiring the collection of audio data. Audio data, however, is extremely sensitive to environmental disturbances, such as additive noise, thus faces some challenges when deployed “in the wild.” To overcome this issue, speech enhancement algorithms have been deployed at the input signal level to improve testing accuracy in noisy conditions. Speech enhancement algorithms can come in different flavors and can be optimized for different tasks (e.g., for human perception vs. machine performance). Data augmentation, in turn, has also been deployed at the model level during training time to improve accuracy in noisy testing conditions. In this paper, we explore the combination of task-specific speech enhancement and data augmentation as a strategy to improve overall multimodal emotion recognition in noisy conditions. We show that AER accuracy under noisy conditions can be improved to levels close to those seen in clean conditions. When compared against a system without speech enhancement or data augmentation, an increase in AER accuracy of 40% was seen in a cross-corpus test, thus showing promising results for “in the wild” AER.

  • Research Article
  • Cite Count Icon 59
  • 10.1016/j.engappai.2020.103775
Emotion recognition using speech and neural structured learning to facilitate edge intelligence
  • Jun 24, 2020
  • Engineering Applications of Artificial Intelligence
  • Md Zia Uddin + 1 more

Emotions are quite important in our daily communications and recent years have witnessed a lot of research works to develop reliable emotion recognition systems based on various types data sources such as audio and video. Since there is no apparently visual information of human faces, emotion analysis based on only audio data is a very challenging task. In this work, a novel emotion recognition is proposed based on robust features and machine learning from audio speech. For a person independent emotion recognition system, audio data is used as input to the system from which, Mel Frequency Cepstrum Coefficients (MFCC) are calculated as features. The MFCC features are then followed by discriminant analysis to minimize the inner-class scatterings while maximizing the inter-class scatterings. The robust discriminant features are then applied with an efficient and fast deep learning approach Neural Structured Learning (NSL) for emotion training and recognition. The proposed approach of combining MFCC, discriminant analysis and NSL generated superior recognition rates compared to other traditional approaches such as MFCC-DBN, MFCC-CNN, and MFCC-RNN during the experiments on an emotion dataset of audio speeches. The system can be adopted in smart environments such as homes or clinics to provide affective healthcare. Since NSL is fast and easy to implement, it can be tried on edge devices with limited datasets collected from edge sensors. Hence, we can push the decision-making step towards where data resides rather than conventionally processing of data and making decisions from far away of the data sources. The proposed approach can be applied in different practical applications such as understanding peoples’ emotions in their daily life and stress from the voice of the pilots or air traffic controllers in air traffic management systems.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 23
  • 10.3390/app10207239
Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
  • Oct 16, 2020
  • Applied Sciences
  • Fei Ma + 4 more

Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. It is crucial for facilitating the affect-related human-machine interaction system by enabling machines to intelligently respond to human emotions. One challenge of this problem is how to efficiently extract feature representations from audio and visual modalities. Although progresses have been made by previous works, most of them ignore common information between audio and visual data during the feature learning process, which may limit the performance since these two modalities are highly correlated in terms of their emotional information. To address this issue, we propose a deep learning approach in order to efficiently utilize common information for audio-visual emotion recognition by correlation analysis. Specifically, we design an audio network and a visual network to extract the feature representations from audio and visual data respectively, and then employ a fusion network to combine the extracted features for emotion prediction. These neural networks are trained by a joint loss, combining: (i) the correlation loss based on Hirschfeld-Gebelein-Rényi (HGR) maximal correlation, which extracts common information between audio data, visual data, and the corresponding emotion labels, and (ii) the classification loss, which extracts discriminative information from each modality for emotion prediction. We further generalize our architecture to the semi-supervised learning scenario. The experimental results on the eNTERFACE’05 dataset, BAUM-1s dataset, and RAVDESS dataset show that common information can significantly enhance the stability of features learned from different modalities, and improve the emotion recognition performance.

  • Research Article
  • Cite Count Icon 125
  • 10.1088/1741-2552/abb580
Data augmentation for enhancing EEG-based emotion recognition with deep generative models
  • Oct 1, 2020
  • Journal of Neural Engineering
  • Yun Luo + 3 more

Data augmentation for enhancing EEG-based emotion recognition with deep generative models

  • Research Article
  • Cite Count Icon 6
  • 10.1088/1742-6596/1976/1/012015
Emotion recognition of musical instruments based on convolution long short time memory depth neural network
  • Jul 1, 2021
  • Journal of Physics: Conference Series
  • Jing Wang + 2 more

In this paper, a method of emotion recognition for musical instruments based on convolution long short time memory depth neural network is proposed, and an emotion recognition music database composed of four musical instruments is established, including keyboard instruments, wind instruments, string instruments and percussion instruments. The emotional types of these four instruments are divided into happiness, anger, sadness and fear. Through the establishment of CLDNN model based musical instrument emotion recognition architecture, MFCCs, CNN and CFS are used for feature extraction and training. The experimental structure shows that the best classification effect is to use long-term memory (LSTM) and deep neural network (DNN) to extract and combine the emotional feature sets of musical instruments, which has the highest accuracy. Considering the dynamic changes of musical features in musical instruments, the modeling method is used to predict the emotional changes of musical instruments.

  • Research Article
  • 10.1121/1.2723990
Interface device to couple a musical instrument to a computing device to allow a user to play a musical instrument in conjunction with a multimedia presentation
  • Jan 1, 2007
  • The Journal of the Acoustical Society of America
  • John Brinkman

Disclosed is an interface device to couple a musical instrument to a computing device to allow a user to play a musical instrument in conjunction with a multimedia presentation. The interface device comprises a processor, a D/A converter, and a digital audio interface. The computing device creates a processed digital audio signal of the musical instrument based upon an original digitized audio signal of the musical instrument from the interface device. The D/A converter then converts a mixed digital signal of both the processed digital audio signal of the musical instrument and a digital audio file received from the computing device into a mixed analog audio signal. The processor controls the digital audio interface such that the mixed digital signal is transmitted through the D/A converter, and ultimately through an analog sound device, to the user.

  • Research Article
  • Cite Count Icon 45
  • 10.1155/2022/7028517
EEG Feature Extraction and Data Augmentation in Emotion Recognition
  • Mar 28, 2022
  • Computational Intelligence and Neuroscience
  • Mahsa Pourhosein Kalashami + 2 more

Emotion recognition is a challenging problem in Brain-Computer Interaction (BCI). Electroencephalogram (EEG) gives unique information about brain activities that are created due to emotional stimuli. This is one of the most substantial advantages of brain signals in comparison to facial expression, tone of voice, or speech in emotion recognition tasks. However, the lack of EEG data and high dimensional EEG recordings lead to difficulties in building effective classifiers with high accuracy. In this study, data augmentation and feature extraction techniques are proposed to solve the lack of data problem and high dimensionality of data, respectively. In this study, the proposed method is based on deep generative models and a data augmentation strategy called Conditional Wasserstein GAN (CWGAN), which is applied to the extracted features to regenerate additional EEG features. DEAP dataset is used to evaluate the effectiveness of the proposed method. Finally, a standard support vector machine and a deep neural network with different tunes were implemented to build effective models. Experimental results show that using the additional augmented data enhances the performance of EEG-based emotion recognition models. Furthermore, the mean accuracy of classification after data augmentation is increased 6.5% for valence and 3.0% for arousal, respectively.

  • Research Article
  • Cite Count Icon 49
  • 10.1007/s11042-019-08397-0
PRATIT: a CNN-based emotion recognition system using histogram equalization and data augmentation
  • Nov 19, 2019
  • Multimedia Tools and Applications
  • Dhara Mungra + 4 more

Emotions are spontaneous feelings that are accompanied by fluctuations in facial muscles, which leads to facial expressions. Categorization of these facial expressions as one of the seven basic emotions - happy, sad, anger, disgust, fear, surprise, and neutral is the intention behind Emotion Recognition. This is a difficult problem because of the complexity of human expressions, but is gaining immense popularity due to its vast number of applications such as predicting behavior. Using deeper architectures has enabled researchers to achieve state-of-the-art performance in emotion recognition. Motivated from the aforementioned discussion, in this paper, we propose a model named as PRATIT, used for facial expression recognition that uses specific image preprocessing steps and a Convolutional Neural Network (CNN) model. In PRATIT, preprocessing techniques such as grayscaling, cropping, resizing, and histogram equalization have been used to handle variations in the images. CNNs accomplish better accuracy with larger datasets, but there are no freely accessible datasets with adequate information for emotion recognition with deep architectures. Therefore, to handle the aforementioned issue, we have applied data augmentation in PRATIT, which helps in further fine tuning the model for performance improvement. The paper presents the effects of histogram equalization and data augmentation on the performance of the model. PRATIT with the usage of histogram equalization during image preprocessing and data augmentation surpasses the state-of-the-art results and achieves a testing accuracy of 78.52%.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/ficloud.2019.00054
The Audio-Visual Arabic Dataset for Natural Emotions
  • Aug 1, 2019
  • Ftoon Abu Shaqra + 2 more

Emotions are a crucial aspect of human life and the researchers have tried to build an automatic emotion recognition system that helps to provide important real-world applications. The psychologists have shown that emotions differ across culture, considering this fact, we provide and describe the first audio-visual Arabic emotional dataset which called (AVANEmo). In this work we aim to fill the gap between studies of emotion recognition for Arabic content and other languages by provided an Arabic dataset which is a major and fundamental part of build emotion recognition application. Our dataset contains 3000 clips for video and audio data, and it covers six basic emotional labels (Happy, Sad, Angry, Surprise, Disgust, Neutral). Also, we provide some baseline experiments to measure the primitive performance for automated audio and visual emotion recognition application using the AVANEmo dataset. The best accuracy that we achieved was 54.5% and 57.9% using the audio and visual data respectively. The data will be available for distribution to researchers.

  • Research Article
  • 10.59461/ijitra.v4i1.164
Emotional Speech Recognition using CNN model
  • Mar 29, 2025
  • International Journal of Information Technology, Research and Applications
  • Samyuktha S + 1 more

Speech Emotion Recognition (SER) is a new area of artificial intelligence that deals with recognizing human emotions from speech signals. Emotions are an important aspect of communication, affecting social interactions and decision-making processes. This paper introduces a complete SER system that uses state-of-the-art deep learning methods to recognize emotions like Happy, Sad, Angry, Neutral, Surprise, Calm, Fear, and Disgust. The suggested model uses Mel-Spectrograms, MFCCs, and Chroma features for efficient feature extraction. Convolutional layers are utilized to capture complex patterns in audio data, whereas dropout layers are included to avoid overfitting and promote model generalization. Data augmentation strategies, such as pitch shifting, noise injection, and time-stretching, are adopted to increase model robustness. Despite improvements in SER, issues like the differentiation of closely correlated emotions, dealing with noisy environments, and real-time performance are domains for future work. This paper advances the research area of affective computing by enhancing emotion recognition performance and widening the scope of SER applications in healthcare, virtual assistants, and customer service systems. Keywords: Speech Emotion Recognition, Mel-Spectrogram, MFCCs, Convolutional Neural Networks, Deep Learning, Data Augmentation, Affective Computing.

  • Book Chapter
  • Cite Count Icon 189
  • 10.1007/978-3-319-73600-6_8
Data Augmentation for EEG-Based Emotion Recognition with Deep Convolutional Neural Networks
  • Jan 1, 2018
  • Fang Wang + 4 more

Emotion recognition is the task of recognizing a person’s emotional state. EEG, as a physiological signal, can provide more detailed and complex information for emotion recognition task. Meanwhile, EEG can’t be changed and hidden intentionally makes EEG-based emotion recognition achieve more effective and reliable result. Unfortunately, due to the cost of data collection, most EEG datasets have small number of EEG data. The lack of data makes it difficult to predict the emotion states with the deep models, which requires enough number of training data. In this paper, we propose to use a simple data augmentation method to address the issue of data shortage in EEG-based emotion recognition. In experiments, we explore the performance of emotion recognition with the shallow and deep computational models before and after data augmentation on two standard EEG-based emotion datasets. Our experimental results show that the simple data augmentation method can improve the performance of emotion recognition based on deep models effectively.

  • Research Article
  • Cite Count Icon 49
  • 10.1109/access.2021.3068316
Effects of Data Augmentation Method Borderline-SMOTE on Emotion Recognition of EEG Signals Based on Convolutional Neural Network
  • Jan 1, 2021
  • IEEE Access
  • Yu Chen + 2 more

In recent years, with the continuous development of artificial intelligence and brain-computer interface technology, emotion recognition based on physiological signals, especially electroencephalogram signals, has become a popular research topic and attracted wide attention. However, the imbalance of the data sets themselves, affective features’ extraction from electroencephalogram signals, and the design of classifiers with excellent performance, pose a great challenge to the subject. Motivated by the outstanding performance of deep learning approaches in pattern recognition tasks, we propose a method based on convolutional neural network with data augmentation method Borderline-synthetic minority oversampling technique. First, we obtain 32-channel electroencephalogram signals from DEAP data set, which is the standard data set of emotion recognition. Then, after data pre-processing, we extract features in frequency domain and data augmentation based on the data augmentation algorithm above for getting more balanced data. Finally, we train a one dimensional convolutional neural network for three classification on two emotional dimensions valence and arousal. Meanwhile, the proposed method is compared with some traditional machine learning methods and some existing methods by other researchers, which is proved to be effective in emotion recognition, and the average accuracy rate of 32 subjects on valence and arousal are 97.47% and 97.76% respectively. Compared with other existing methods, the performance of the proposed method with data augmentation algorithm Borderline-SMOTE shows its advantage in affective emotional recognition than that without Borderline-SMOTE.

  • Research Article
  • Cite Count Icon 4
  • 10.1007/s11042-021-10673-x
Cost-effective real-time recognition for human emotion-age-gender using deep learning with normalized facial cropping preprocess
  • Mar 2, 2021
  • Multimedia Tools and Applications
  • Ta-Te Lu + 3 more

Because of technological advancement, human face recognition has been commonly applied in various fields. There are some HCI-related applications, such as camera-ready chatbot and companion robot, require gathering more information from user’s face. In this paper, we developed a system called EAGR for emotion, age, and gender recognition, which can perceive user’s emotion, age and gender based on the face detection. The EAGR system first applies normalized facial cropping (NFC) as a preprocessing method for training data before data augmentation, then uses convolution neural network (CNN) as three training models for recognizing seven emotions (six basics plus one neutral emotion), four age groups, and two genders. For better emotion recognition, the NFC will extract facial features without hair retained. On the other hand, the NFC will extract facial features with hair retained for better age and gender recognition. The experiments were conducted on these three training models of emotion, age and gender recognitions. The recognition performance results from the testing dataset, which has been normalized for tilted head by proposed binocular line angle correction (BLAC), showed that the optimal mean accuracy rates of real-time recognition for seven emotions, four age groups and two genders were 82.4%, 74.95%, and 96.65% respectively. Furthermore, the training time can be substantially reduced via NFC preprocessing. Therefore, we believe that EAGR system is cost-effective in recognizing human emotions, ages, and genders. The EAGR system can be further applied in social applications to help HCI service provide more accurate feedback from pluralistic facial classifications.

  • Research Article
  • Cite Count Icon 3
  • 10.1080/00051144.2024.2371249
Data augmentation using a 1D-CNN model with MFCC/MFMC features for speech emotion recognition
  • Jul 3, 2024
  • Automatika
  • Thomas Mary Little Flower + 2 more

Speech emotion recognition (SER) is attractive in several domains, such as automated translation, call centres, intelligent healthcare, and human–computer interaction. Deep learning models for emotion identification need considerable labelled data, which is only sometimes available in the SER industry. A database needs enough speech samples, good features, and a better classifier to identify emotions efficiently. This study uses data augmentation to enhance the amount of input voice samples and address the data shortage issue. The database capacity increases by adding white noise to the speech signals by data augmentation. In this work, the Mel-frequency Cepstral Coefficient (MFCC) and Mel-frequency Magnitude Coefficient (MFMC) features, along with a one-dimensional convolutional neural network (1D-CNN), are used to classify speech emotions. The datasets utilized to estimate the model's enactment were AESDD, CAFE, EmoDB, IEMOCAP, and MESD. The data augmentation with the 1D-CNN (MFMC) model performed best, with an average accuracy of 99.2% for AESDD, 99.5% for CAFE, 97.5% for EmoDB, 92.4% for IEMOCAP and 96.9% for the MESD database. The proposed 1D-CNN (MFMC) with data augmentation outperforms the 1D-CNN (MFCC) without data augmentation in emotion recognition.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.