Raw Speech Signal Research Articles

Recent studies have reported that the performance of Automatic Speech Recognition (ASR) technologies designed for normal speech notably deteriorates when it is evaluated by whispered speech. Therefore, the detection of whispered speech is useful in order to attenuate the mismatch between training and testing situations. This paper proposes two new Glottal Flow (GF)-based features, namely, GF-based Mel-Frequency Cepstral Coefficient (GF-MFCC) as a magnitude-based feature and GF-based relative phase (GF-RP) as a phase-based feature for whispered speech detection. The main contribution of the proposed features is to extract magnitude and phase information obtained by the GF signal. In the GF-MFCC, Mel-frequency cepstral coefficient (MFCC) feature extraction is modified using the estimated GF signal derived from the iterative adaptive inverse filtering as the input to replace the raw speech signal. In a similar way, the GF-RP feature is the modification of the relative phase (RP) feature extraction by using the GF signal instead of the raw speech signal. The whispered speech production provides lower amplitude from the glottal source than normal speech production, thus, the whispered speech via Discrete Fourier Transformation (DFT) provides the lower magnitude and phase information, which make it different from a normal speech. Therefore, it is hypothesized that two types of our proposed features are useful for whispered speech detection. In addition, using the individual GF-MFCC/GF-RP feature, the feature-level and score-level combination are also proposed to further improve the detection performance. The performance of the proposed features and combinations in this study is investigated using the CHAIN corpus. The proposed GF-MFCC outperforms MFCC, while GF-RP has a higher performance than the RP. Further improved results are obtained via the feature-level combination of MFCC and GF-MFCC (MFCC&GF-MFCC)/RP and GF-RP(RP&GF-RP) compared with using either one alone. In addition, the combined score of MFCC&GF-MFCC and RP&GF-RP gives the best frame-level accuracy of 95.01% and the utterance-level accuracy of 100%.

Read full abstract

The success of deep learning in speech emotion recognition has led to its application in resource-constrained devices. It has been applied in human-to-machine interaction applications like social living assistance, authentication, health monitoring and alertness systems. In order to ensure a good user experience, robust, accurate and computationally efficient deep learning models are necessary. Recurrent neural networks (RNN) like long short-term memory (LSTM), gated recurrent units (GRU) and their variants that operate sequentially are often used to learn time series sequences of the signal, analyze long-term dependencies and the contexts of the utterances in the speech signal. However, due to their sequential operation, they encounter problems in convergence and sluggish training that uses a lot of memory resources and encounters the vanishing gradient problem. In addition, they do not consider spatial cues that may exist in the speech signal. Therefore, we propose an attention-based multi-learning model (ABMD) that uses residual dilated causal convolution (RDCC) blocks and dilated convolution (DC) layers with multi-head attention. The proposed ABMD model achieves comparable performance while taking global contextualized long-term dependencies between features in a parallel manner using a large receptive field with less increase in the number of parameters compared to the number of layers and considers spatial cues among the speech features. Spectral and voice quality features extracted from the raw speech signals are used as inputs. The proposed ABMD model obtained a recognition accuracy and F1 score of 93.75% and 92.50% on the SAVEE datasets, 85.89% and 85.34% on the RAVDESS datasets and 95.93% and 95.83% on the EMODB datasets. The model’s robustness in terms of the confusion ratio of the individual discrete emotions especially happiness which is often confused with emotions that belong to the same dimensional plane with it also improved when validated on the same datasets.

Read full abstract

Raw Speech Signal Research Articles

Related Topics

Articles published on Raw Speech Signal

Automatic speaker and age identification of children from raw speech using sincNet over ERB scale

WavDepressionNet: Automatic Depression Level Prediction Via Raw Speech Signals

MSER: Multimodal speech emotion recognition using cross-attention with deep fusion

Non-uniform Speaker Disentanglement For Depression Detection From Raw Speech Signals.

Improving Parkinson’s disease recognition through voice analysis using deep learning

A deep learning-based model for detecting depression in senior population.

Whispered Speech Detection Using Glottal Flow-Based Features

CyTex: Transforming speech to textured images for speech emotion recognition

Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution

End-to-End Pathological Speech Detection Using Wavelet Scattering Network

Robust speech watermarking by a jointly trained embedder and detector using a DNN

One-dimensional convolutional neural network and hybrid deep-learning paradigm for classification of specific language impaired children using their speech

A hybrid CNN-LiGRU acoustic modeling using raw waveform sincnet for Hindi ASR

LIS-Net: An end-to-end light interior search network for speech command recognition

Direct Speech-to-Image Translation

Interpretable Representation Learning for Speech and Audio Signals Based on Relevance Weighting

Glottal Source Information for Pathological Voice Detection

End-to-End Speech Emotion Recognition With Gender Information

End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition

Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Raw Speech Signal Research Articles

Related Topics

Articles published on Raw Speech Signal

Automatic speaker and age identification of children from raw speech using sincNet over ERB scale

WavDepressionNet: Automatic Depression Level Prediction Via Raw Speech Signals

MSER: Multimodal speech emotion recognition using cross-attention with deep fusion

Non-uniform Speaker Disentanglement For Depression Detection From Raw Speech Signals.

Improving Parkinson’s disease recognition through voice analysis using deep learning

A deep learning-based model for detecting depression in senior population.

Whispered Speech Detection Using Glottal Flow-Based Features

CyTex: Transforming speech to textured images for speech emotion recognition

Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution

End-to-End Pathological Speech Detection Using Wavelet Scattering Network

Robust speech watermarking by a jointly trained embedder and detector using a DNN

One-dimensional convolutional neural network and hybrid deep-learning paradigm for classification of specific language impaired children using their speech

A hybrid CNN-LiGRU acoustic modeling using raw waveform sincnet for Hindi ASR

LIS-Net: An end-to-end light interior search network for speech command recognition

Direct Speech-to-Image Translation

Interpretable Representation Learning for Speech and Audio Signals Based on Relevance Weighting

Glottal Source Information for Pathological Voice Detection

End-to-End Speech Emotion Recognition With Gender Information

End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition

Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition