Differentiable Measures for Speech Spectral Modeling

Miguel Arjona Ramirez,Renata Lopes Rosa,Wesley Beccaro,Demostenes Zegarra Rodriguez

doi:10.1109/access.2022.3150728

Miguel Arjona Ramirez, Renata Lopes Rosa + Show 2 more

Open Access

https://doi.org/10.1109/access.2022.3150728

Copy DOI

Abstract

Autoregressive models for the envelope of speech power spectral densities (PSDs) are refined by the self-supervised spectral learning machine (S3LM) provided with differentiable spectral objective functions, including the Itakura-Saito divergence (ISD), the Kullback-Leibler divergence (KLD), the reverse KLD (RKLD) and the log spectral distortion (LSD), which display more significant results. However, in order to assess the models more perceptually, a method is proposed based upon perturbations around perfect reconstruction analysis-synthesis configurations. In the cross-excitation analysis-synthesis assessment (CEASA) method, the residual signals generated by analysis filters of the spectral models are injected as excitation into the synthesis filters derived from the same and other models in order to be evaluated by the perceptual evaluation of speech quality (PESQ) and Itakura divergence (ID), which are averaged over a set of models obtained using the objective functions mentioned above. The results lead to a superior performance when the RKLD is used as the loss function for the estimation of the spectral models with the ISD ranking close behind. The focus of these divergences on the spectral peaks is argued and pointed as the most important factor for this behavior. Specifically, using the PESQ scores obtained with CEASA, the RKLD loss is found to improve the performance by 1.0%, 4.0% and 19.3% with respect to the open-loop analysis, the KLD and the LSD models, respectively, while the corresponding improvements for the ISD loss are 0.1%, 3.0% and 18.2%, and the RKLD models excel the ISD models by 1.0% on average. Even though the spectral measures alone are not able to unequivocally distinguish the better of the two, CEASA is shown to have enough sensitivity to distinguish their performances. In summary, the learning machine S3LM fits models for the short-term spectral envelope of speech and, for the evaluation of its performance under several differentiable loss functions, the CEASA assessment tool has been developed. In addition, CEASA may be used for other assessments connected with speech analysis and synthesis.

Highlights

M ODELS for the envelope of speech spectra [1] are important for various tasks that require speech analysis, such as speech coding, speech synthesis, automatic speech recognition and speech enhancement.Autoregressive models for speech power spectral density S(ejω) may be obtained by the application of the Wiener-Khinchin theorem to get the autocorrelation function [2] R(m) = 1 2π π Sdω −π (1)for m = 0, 1, · · ·, p, in order to determine an autoregressive model of order p
In order to assess the fidelity of the spectral envelope model in more neutral conditions, the cross-excitation analysissynthesis assessment (CEASA) was used, which is depicted in Fig. 2 for the simple case involving two models, where two prediction vectors a1 and a2 are input from S3LM or any other modeling system for that matter
Using the perceptual evaluation of speech quality (PESQ) scores obtained with cross-excitation analysis-synthesis assessment (CEASA), the reverse KLD (RKLD) loss is found to improve the performance by 1.0%, 4.0% and 19.3% with respect to the open-loop analysis, the Kullback-Leibler divergence (KLD) and the log spectral distortion (LSD) models, respectively, while the corresponding improvements for the Itakura-Saito divergence (ISD) loss are 0.1%, 3.0% and 18.2% and the RKLD models excel the ISD models by 1.0% on average

Summary

INTRODUCTION

M ODELS for the envelope of speech spectra [1] are important for various tasks that require speech analysis, such as speech coding, speech synthesis, automatic speech recognition and speech enhancement. In connection with these applications, we propose an analysis-synthesis assessment method for the spectral models which is more suitable to evaluate their performance in action In this context, this work intends to improve the open-loop analytical (OLA) model using a machine learning algorithm in conjunction with several differentiable loss functions that are applied to the reference and reconstructed power spectral densities. We use signal processing techniques such as autoregressive models, prediction and perfect reconstruction in analysissynthesis systems which are integrated with machine learning structures to come up with tied spectral weighting layers (TSWLs) These techniques are used both in the proposed learning machine for the layers and the losses and in the CEASA diagnostic tool which includes analysis-synthesis techniques based on perfect reconstruction. Training and testing are simultaneous since S3LM is self-supervised

MEASURES FOR SPECTRAL ANALYSIS

DIFFERENTIABLE LOSS COMPARISONS

ASSESSMENT RESULTS

Method

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2022
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Differentiable Measures for Speech Spectral Modeling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis
Cédric Févotte ... Jean-Louis Durrieu
Neural Computation | VOL. 21
Cédric Févotte, et. al.Cédric Févotte ... Jean-Louis Durrieu
01 Mar 2009
Neural Computation | VOL. 21

Nonnegative factorization of sequences of speech and music spectra
Miguel Arjona Ramirez
-
Miguel Arjona RamirezMiguel Arjona Ramirez
01 Aug 2014
01 Aug 2014

Source and Filter Estimation for Throat-Microphone Speech Enhancement
M A Tugtekin Turan ... Engin Erzin
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 24
M A Tugtekin Turan, et. al.M A Tugtekin Turan ... Engin Erzin
01 Feb 2016
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 24

Sensory Evaluation of Odor Approximation Using Nmf with Kullback-Leibler Divergence and Itakura-Saito Divergence in Mass Spectrum Space
Dani Prasetyawan ... Takamichi Nakamoto
Electrochemical Society Meeting Abstracts | VOL. MA2020-01
Dani Prasetyawan, et. al.Dani Prasetyawan ... Takamichi Nakamoto
01 May 2020
Electrochemical Society Meeting Abstracts | VOL. MA2020-01

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Differentiable Measures for Speech Spectral Modeling

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access