Speech Waveform Research Articles

Head motion generation task for speech-driven virtual agent animation is commonly explored with handcrafted audio features, such as MFCCs as input features, plus additional features, such as energy and F0 in the literature. In this paper, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to better performance than using standard acoustic features to generate head motion overall. At the same time, we completely abandon the handcrafted feature extraction process, leading to more effectiveness. However, the difficulty of creating a task-specific feature from waveform is their staggering quantity of irrelevant information, implicating potential cumbrance for neural network training. Thus, we apply a canonical-correlation-constrained autoencoder (CCCAE), where we are able to compress the high-dimensional waveform into a low-dimensional embedded feature, with the minimal error in reconstruction, and sustain the relevant information with the maximal cannonical correlation to head motion. We extend our previous research by including more speakers in our dataset and also adapt with a recurrent neural network, to show the feasibility of our proposed feature. Through comparisons between different acoustic features, our proposed feature, WavCCCAE, shows at least a 20% improvement in the correlation from the waveform, and outperforms the popular acoustic feature, MFCC, by at least 5% respectively for all speakers. Through the comparison in the feedforward neural network regression (FNN-regression) system, the WavCCCAE-based system shows comparable performance in objective evaluation. In long short-term memory (LSTM) experiments, LSTM-models improve the overall performance in normalised mean square error (NMSE) and CCA metrics, and adapt the WavCCCAEfeature better, which makes the proposed LSTM-regression system outperform the MFCC-based system. We also re-design the subjective evaluation, and the subjective results show the animations generated by models where WavCCCAEwas chosen to be better than the other models by the participants of MUSHRA test.

Read full abstract

The task of developing an automatic speaker verification (ASV) system for children's speech is a formidable one due to a number of reasons. The dearth of domain-specific data is one among them. The challenge further intensifies with the introduction of short utterances of speech, a relatively unexplored domain in the case of children's ASV. Voice-based biometric systems suffers miserably when speech data, inadequate both in volume as well as in duration, is used either for enrollment or verification. To circumvent the issue arising due to data scarcity, the work in this paper extensively explores in-domain as well as various out-of-domain data augmentation techniques. A data augmentation approach is proposed that encompasses both in-domain and out-of-domain data augmentation techniques. The in-domain data augmentation approach, incorporates speed perturbation of children's speech. The out-of-domain data used are from adult speakers which are known to have acoustic attributes in stark contrast to child speakers. The acoustic characteristics of the adult speech data in this study are altered on two fronts namely speech waveform modification and feature-level modification, in order to modify the adult acoustic features and render it acoustically similar to children's speech prior to augmentation. While the speech waveform modification involves various signal processing techniques like prosody modification, formant modification and voice-conversion. The feature-level modification on the other hand involves Vocal-tract length normalization technique (VTLN) which explicitly models and compensates for the ill-effects of variations in vocal tract length by linearly warping the frequency axis of speech signals. The proposed data augmentation approach helps not only in increasing the amount of training data but also in effectively capturing the missing target attributes which helps in boosting the verification performance. A relative improvement of 48.01% in equal error rate (EER) with respect to the baseline system is a testimony of it. Furthermore, the conventionally used Mel-frequency cepstral coefficients (MFCC) are known to average out the higher-frequency components. Prior literary works have shown that a significant amount of relevant acoustic information is available in the higher-frequency region of the children's speech. Therefore, effective preservation of higher-frequency contents in children's speech is of paramount importance which must be appropriately tackled for the development of a reliable and robust children's ASV system. In this regard, frame-level concatenation of the MFCC features with the Inverse-Mel-frequency cepstral coefficient (IMFCC) features is undertaken. The feature concatenation of MFCC and IMFCC is carried out with the sole intention of effectively preserving the higher-frequency contents in the children's speech data. The low canonical correlation existing between the MFCC and the IMFCC feature vectors provides the necessary impetus to go with their feature fusion. The feature concatenation approach, when combined with proposed data augmentation, helps in further improvement of the verification performance. The experimental results testify our claims, wherein we have achieved an overall relative reduction of 50.15% for equal error rate.

Read full abstract

Speech Waveform Research Articles

Related Topics

Articles published on Speech Waveform

Multi-scale Information Aggregation for Spoofing Detection

Emotion Classification from Speech Waveform Using Machine Learning and Deep Learning Techniques

Anatomy of a Spectrogram

Hybrid CNN-BiLSTM architecture with multiple attention mechanisms to enhance speech emotion recognition

The extraction method used for English–Chinese machine translation corpus based on bilingual sentence pair coverage

Noise-Robust Automatic Speech Recognition: A Case Study for Communication Interference

Speech Emotion Recognition under Noisy Environments with SNR Down to −6 dB Using Multi-Decoder Wave-U-Net

Speech decoding from stereo-electroencephalography (sEEG) signals using advanced deep learning methods

Vocoder Detection of Spoofing Speech Based on GAN Fingerprints and Domain Generalization

Speech-driven head motion generation from waveforms

Recalling-Enhanced Recurrent Neural Network optimized with Chimp Optimization Algorithm based speech enhancement for hearing aids

Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

Feasibility Study of Parkinson's Speech Disorder Evaluation With Pre-Trained Deep Learning Model for Speech-to-Text Analysis.

Experimental studies for improving the performance of children's speaker verification system using short utterances

Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder

W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision

MGlif: A software tool for manual glottal inverse filtering

Retracted: Intelligent English Translation Based on Intelligent Speech Waveform Analysis

Emotion Recognition From Speech and Text using Long Short-Term Memory

An Escalated Eavesdropping Attack on Mobile Devices via Low-Resolution Vibration Signals

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Speech Waveform Research Articles

Related Topics

Articles published on Speech Waveform

Multi-scale Information Aggregation for Spoofing Detection

Emotion Classification from Speech Waveform Using Machine Learning and Deep Learning Techniques

Anatomy of a Spectrogram

Hybrid CNN-BiLSTM architecture with multiple attention mechanisms to enhance speech emotion recognition

The extraction method used for English–Chinese machine translation corpus based on bilingual sentence pair coverage

Noise-Robust Automatic Speech Recognition: A Case Study for Communication Interference

Speech Emotion Recognition under Noisy Environments with SNR Down to −6 dB Using Multi-Decoder Wave-U-Net

Speech decoding from stereo-electroencephalography (sEEG) signals using advanced deep learning methods

Vocoder Detection of Spoofing Speech Based on GAN Fingerprints and Domain Generalization

Speech-driven head motion generation from waveforms

Recalling-Enhanced Recurrent Neural Network optimized with Chimp Optimization Algorithm based speech enhancement for hearing aids

Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

Feasibility Study of Parkinson's Speech Disorder Evaluation With Pre-Trained Deep Learning Model for Speech-to-Text Analysis.

Experimental studies for improving the performance of children's speaker verification system using short utterances

Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder

W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision

MGlif: A software tool for manual glottal inverse filtering

Retracted: Intelligent English Translation Based on Intelligent Speech Waveform Analysis

Emotion Recognition From Speech and Text using Long Short-Term Memory

An Escalated Eavesdropping Attack on Mobile Devices via Low-Resolution Vibration Signals