Abstract
Autoregressive models for the envelope of speech power spectral densities (PSDs) are refined by the self-supervised spectral learning machine (S3LM) provided with differentiable spectral objective functions, including the Itakura-Saito divergence (ISD), the Kullback-Leibler divergence (KLD), the reverse KLD (RKLD) and the log spectral distortion (LSD), which display more significant results. However, in order to assess the models more perceptually, a method is proposed based upon perturbations around perfect reconstruction analysis-synthesis configurations. In the cross-excitation analysis-synthesis assessment (CEASA) method, the residual signals generated by analysis filters of the spectral models are injected as excitation into the synthesis filters derived from the same and other models in order to be evaluated by the perceptual evaluation of speech quality (PESQ) and Itakura divergence (ID), which are averaged over a set of models obtained using the objective functions mentioned above. The results lead to a superior performance when the RKLD is used as the loss function for the estimation of the spectral models with the ISD ranking close behind. The focus of these divergences on the spectral peaks is argued and pointed as the most important factor for this behavior. Specifically, using the PESQ scores obtained with CEASA, the RKLD loss is found to improve the performance by 1.0%, 4.0% and 19.3% with respect to the open-loop analysis, the KLD and the LSD models, respectively, while the corresponding improvements for the ISD loss are 0.1%, 3.0% and 18.2%, and the RKLD models excel the ISD models by 1.0% on average. Even though the spectral measures alone are not able to unequivocally distinguish the better of the two, CEASA is shown to have enough sensitivity to distinguish their performances. In summary, the learning machine S3LM fits models for the short-term spectral envelope of speech and, for the evaluation of its performance under several differentiable loss functions, the CEASA assessment tool has been developed. In addition, CEASA may be used for other assessments connected with speech analysis and synthesis.
Highlights
M ODELS for the envelope of speech spectra [1] are important for various tasks that require speech analysis, such as speech coding, speech synthesis, automatic speech recognition and speech enhancement.Autoregressive models for speech power spectral density S(ejω) may be obtained by the application of the Wiener-Khinchin theorem to get the autocorrelation function [2] R(m) = 1 2π π Sdω −π (1)for m = 0, 1, · · ·, p, in order to determine an autoregressive model of order p
In order to assess the fidelity of the spectral envelope model in more neutral conditions, the cross-excitation analysissynthesis assessment (CEASA) was used, which is depicted in Fig. 2 for the simple case involving two models, where two prediction vectors a1 and a2 are input from S3LM or any other modeling system for that matter
Using the perceptual evaluation of speech quality (PESQ) scores obtained with cross-excitation analysis-synthesis assessment (CEASA), the reverse KLD (RKLD) loss is found to improve the performance by 1.0%, 4.0% and 19.3% with respect to the open-loop analysis, the Kullback-Leibler divergence (KLD) and the log spectral distortion (LSD) models, respectively, while the corresponding improvements for the Itakura-Saito divergence (ISD) loss are 0.1%, 3.0% and 18.2% and the RKLD models excel the ISD models by 1.0% on average
Summary
M ODELS for the envelope of speech spectra [1] are important for various tasks that require speech analysis, such as speech coding, speech synthesis, automatic speech recognition and speech enhancement. In connection with these applications, we propose an analysis-synthesis assessment method for the spectral models which is more suitable to evaluate their performance in action In this context, this work intends to improve the open-loop analytical (OLA) model using a machine learning algorithm in conjunction with several differentiable loss functions that are applied to the reference and reconstructed power spectral densities. We use signal processing techniques such as autoregressive models, prediction and perfect reconstruction in analysissynthesis systems which are integrated with machine learning structures to come up with tied spectral weighting layers (TSWLs) These techniques are used both in the proposed learning machine for the layers and the losses and in the CEASA diagnostic tool which includes analysis-synthesis techniques based on perfect reconstruction. Training and testing are simultaneous since S3LM is self-supervised
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.