Task-specific speech enhancement and data augmentation for improved multimodal emotion recognition under noisy conditions
Automatic emotion recognition (AER) systems are burgeoning and systems based on either audio, video, text, or physiological signals have emerged. Multimodal systems, in turn, have shown to improve overall AER accuracy and to also provide some robustness against artifacts and missing data. Collecting multiple signal modalities, however, can be very intrusive, time consuming, and expensive. Recent advances in deep learning based speech-to-text and natural language processing systems, however, have enabled the development of reliable multimodal systems based on speech and text while only requiring the collection of audio data. Audio data, however, is extremely sensitive to environmental disturbances, such as additive noise, thus faces some challenges when deployed “in the wild.” To overcome this issue, speech enhancement algorithms have been deployed at the input signal level to improve testing accuracy in noisy conditions. Speech enhancement algorithms can come in different flavors and can be optimized for different tasks (e.g., for human perception vs. machine performance). Data augmentation, in turn, has also been deployed at the model level during training time to improve accuracy in noisy testing conditions. In this paper, we explore the combination of task-specific speech enhancement and data augmentation as a strategy to improve overall multimodal emotion recognition in noisy conditions. We show that AER accuracy under noisy conditions can be improved to levels close to those seen in clean conditions. When compared against a system without speech enhancement or data augmentation, an increase in AER accuracy of 40% was seen in a cross-corpus test, thus showing promising results for “in the wild” AER.
- Research Article
5
- 10.1007/s10772-016-9336-6
- Mar 4, 2016
- International Journal of Speech Technology
Vowel onset point (VOP) is the instant of time at which the vowel region starts in a speech signal. The VOPs are used as anchor points to design various speech based systems. Different algorithms exist in the literature to identify the occurrences of vowels in continuous spoken utterances. The algorithm based on combined evidences derived from source excitation, spectral peaks and modulation spectrum have been used as a baseline system for the present study. The baseline system provides a satisfactory level of performance under clean data condition. However under noisy data condition the performance of the previous system may be improved further by additional pre-processing of the raw speech data and post-processing the detected VOPs. In this paper we propose to use the speech enhancement techniques as pre-processing module to remove the noise from the speech data under different noisy conditions. The pre-processed speech data is then passed through the baseline system to detect the VOPs. It has been observed that there exist several spurious VOPs at the output of the baseline system. We propose to use a post-processing module based on average signal-to-noise ratio and information derived from the glottal closure instant to remove the spurious VOPs. The experiments were carried out on clean, artificially injected noisy, and data collected from the practical noisy environments. The results suggest that the proposed system using pre-processing and post-processing modules is robust and shows an improvement of 28–35 % over the existing baseline system by removing the spurious VOPs under different noisy conditions.
- Conference Article
11
- 10.1109/cesys.2017.8321280
- Oct 1, 2017
The aim of the Speech Enhancement (SE) algorithms is to recover clean speech from noise-corrupted speech. The applications of SE algorithms are in speech recognition, communication, signal compression and in the hearing aid devices. Speech Enhancement is done in several ways based on the type of noise, the parameter for enhancement, the domain of operation, the computational load. This paper discusses various speech enhancement algorithms consisting of spectral subtraction, multi-band spectral subtraction, statistical estimation, phase spectrum compensation, optimal filtering, discrete wavelet transform. The processed speeches from these SE algorithms are compared at the same set of conditions using five parameters like Perceptual Evaluation of Speech Quality, Short Time Objective Ineligibility measure, Minimum Mean Square Error, Itakura-Satio distance, Processing time. The noisy speeches are taken from Noizeus database with various noisy conditions and in different Signal to Noise Ratios.
- Research Article
25
- 10.1016/j.specom.2008.05.004
- May 20, 2008
- Speech Communication
Combined speech enhancement and auditory modelling for robust distributed speech recognition
- Research Article
24
- 10.1016/j.apacoust.2016.05.012
- Jun 3, 2016
- Applied Acoustics
A new regularized forward blind source separation algorithm for automatic speech quality enhancement
- Research Article
10
- 10.1016/j.compeleceng.2016.04.006
- Apr 26, 2016
- Computers & Electrical Engineering
An efficient frequency-domain adaptive forward BSS algorithm for acoustic noise reduction and speech quality enhancement
- Research Article
6
- 10.1016/j.apacoust.2021.108382
- Sep 14, 2021
- Applied Acoustics
Robust children’s speech recognition in zero resource condition
- Research Article
15
- 10.1007/s10772-015-9310-8
- Oct 12, 2015
- International Journal of Speech Technology
While i-vectors with probabilistic linear discriminant analysis (PLDA) can achieve state-of-the-art performance in speaker verification, the mismatch caused by acoustic noise remains a key factor affecting system performance. In this paper, a fusion system that combines a multi-condition signal-to-noise ratio (SNR)-independent PLDA model and a mixture of SNR-dependent PLDA models is proposed to make speaker verification systems more noise robust. First, the whole range of SNR that a verification system is expected to operate is divided into several narrow ranges. Then, a set of SNR-dependent PLDA models, one for each narrow SNR range, are trained. During verification, the SNR of the test utterance is used to determine which of the SNR-dependent PLDA models is used for scoring. To further enhance performance, the SNR-dependent and SNR-independent models are fused using linear and logistic regression fusion. The performance of the fusion system and the SNR-dependent system is evaluated on the NIST 2012 speaker recognition evaluation for both noisy and clean conditions. Results show that a mixture of SNR-dependent PLDA models perform better in both clean and noisy conditions. It was also found that the fusion system is more robust than the conventional i-vector/PLDA systems under noisy conditions.
- Research Article
- 10.32628/cseit24103216
- Jun 27, 2024
- International Journal of Scientific Research in Computer Science, Engineering and Information Technology
Emotion detection, hence, is an important step toward making human-computer interaction a more enhanced process, where systems are made capable of identifying and responding to the emotional state of users. In fact, multimodal emotion detection systems in which both auditory and visual information are fused are emerging, and these approaches toward expressive emotional states are complementary and robust. Multimodal systems enhance the quality of interacting and, through many applications, can diagnose emotional disorders, monitor automotive safety, and improve human-robot interactions. In nature, the high-dimensional space and dynamic threats have resulted in obtaining low accuracy and high computational cost in applying the traditional models based on single-modality data. On the other hand, multimodal systems explore the synergy between audio and visual data, giving better performance and higher accuracy in inferring subtle emotional expressions. The latest improvement was done on these systems using recent advancements in transfer learning and deep learning techniques.That being said, this research Proposal devises a multimodal emotion recognition system integrating speech and face information through transfer learning for improved accuracy and robustness. Serving this purpose, the objectives of this research entail the effective comparison among different transfer-learning strategies, including the impact of pre-trained models in speech-based emotion recognition, and to introduce the role of voice activity detection in the process. Advanced neural network architectures like Spatial Transformer Networks and bidirectional LSTM in facial emotion recognition will also be tested. Early and late fusion strategies will also be used to find the best strategy for combining speech and facial data.This research will target several challenges that involve the complexity of data, balancing of the model performance-robustness balance, computational limitations, and standardization of evaluations in developing a working and robust emotion recognition system to enhance digital interaction and apply in practical areas. The goal is to create a system that oversteps the limitation of single-modality models through state-of-the-art advances in deep learning, as well as front-line improvements in transfer learning, in the manner of emotion detection performance.
- Research Article
26
- 10.1016/j.apacoust.2018.02.002
- Feb 16, 2018
- Applied Acoustics
A dual fast NLMS adaptive filtering algorithm for blind speech quality enhancement
- Conference Article
6
- 10.1109/icassp.2016.7471745
- Mar 1, 2016
We propose a novel Power Spectral Density (PSD) estimator for multi-microphone systems operating in reverberant and noisy conditions. The estimator is derived using the maximum likelihood approach and is based on a blocked and pre-whitened additive signal model. The intended application of the estimator is in speech enhancement algorithms, such as the Multi-channel Wiener Filter (MWF) and the Minimum Variance Distortionless Response (MVDR) beamformer. We evaluate these two algorithms in a speech dereverberation task and compare the performance obtained using the proposed and a competing PSD estimator. Instrumental performance measures indicate an advantage of the proposed estimator over the competing one. In a speech intelligibility test all algorithms significantly improved the word intelligibility score. While the results suggest a minor advantage of using the proposed PSD estimator, the difference between algorithms was found to be statistically significant only in some of the experimental conditions.
- Research Article
14
- 10.1016/j.apacoust.2015.02.010
- Mar 30, 2015
- Applied Acoustics
A two-sensor Gauss–Seidel fast affine projection algorithm for speech enhancement and acoustic noise reduction
- Conference Article
4
- 10.1109/acii.2015.7344653
- Sep 1, 2015
My PhD work aims at developing computational methodologies for automatic emotion recognition from audiovisual behavioral data. A main challenge in automatic emotion recognition is that human behavioral data are highly complex, due to multiple sources that vary and modulate behaviors. My goal is to provide computational frameworks for understanding and controlling for multiple sources of variation in human behavioral data that co-occur with the production of emotion, with the aim of improving automatic emotion recognition systems [1]-[6]. In particular, my research aims at providing representation, modeling, and analysis methods for complex and time-changing behaviors in human audio-visual data by introducing temporal segmentation and time-series analysis techniques. This research contributes to the affective computing community by improving the performance of automatic emotion recognition systems and increasing the understanding of affective cues embedded within complex audio-visual data.
- Research Article
6
- 10.1109/taslp.2020.3040850
- Nov 30, 2020
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
Recently we presented a modulation-domain multichannel Kalman filtering (MKF) algorithm for speech enhancement, which jointly exploits the inter-frame modulation-domain temporal evolution of speech and the inter-channel spatial correlation to estimate the clean speech signal. The goal of speech enhancement is to suppress noise while keeping the speech undistorted, and a key problem is to achieve the best trade-off between speech distortion and noise reduction. In this paper, we extend the MKF by presenting a modulation-domain parametric MKF (PMKF) which includes a parameter that enables flexible control of the speech enhancement behaviour in each time-frequency (TF) bin. Based on the decomposition of the MKF cost function, a new cost function for PMKF is proposed, which uses the controlling parameter to weight the noise reduction and speech distortion terms. An optimal PMKF gain is derived using a minimum mean squared error (MMSE) criterion. We analyse the performance of the proposed MKF, and show its relationship to the speech distortion weighted multichannel Wiener filter (SDW-MWF). To evaluate the impact of the controlling parameter on speech enhancement performance, we further propose PMKF speech enhancement systems in which the controlling parameter is adaptively chosen in each TF bin. Experiments on a publicly available head-related impulse response (HRIR) database in different noisy and reverberant conditions demonstrate the effectiveness of the proposed method.
- Research Article
129
- 10.1121/1.1852873
- Mar 1, 2005
- The Journal of the Acoustical Society of America
A single-channel speech enhancement algorithm utilizing speech pause detection and nonlinear spectral subtraction is proposed for cochlear implant patients in the present study. The spectral subtraction algorithm estimates the short-time spectral magnitude of speech by subtracting the estimated noise spectral magnitude from the noisy speech spectral magnitude. The artifacts produced by spectral subtraction (such as “musical noise”) were significantly reduced by combining variance-reduced gain function and spectral flooring. Sentence recognition by seven cochlear implant subjects was tested under different noisy listening conditions (speech-shaped noise and 6-talker speech babble at +9, +6, +3, and 0 dB SNR) with and without the speech enhancement algorithm. For speech-shaped noise, performance for all subjects at all SNRs was significantly improved by the speech enhancement algorithm; for speech babble, performance was only modestly improved. The results suggest that the proposed speech enhancement algorithm may be beneficial for implant users in noisy listening.
- Research Article
2
- 10.1016/j.specom.2016.04.004
- Apr 29, 2016
- Speech Communication
A pre-processing method for improvement of vowel onset point detection under noisy conditions
- Ask R Discovery
- Chat PDF