Abstract
Speech coding has been shown to achieve good speech quality using either waveform matching or parametric reconstruction. For very low bit rate streams, recently developed generative speech models can reconstruct high-quality wideband speech from the bit streams of standard parametric encoders at less than 3 kb/s. Generative codecs produce high-quality speech based on synthesising speech from a DNN and the parametric input. Existing objective speech quality models (e.g., ViSQOL and POLQA) cannot be used to accurately evaluate the quality of coded speech from generative models as they penalise based on signal differences not apparent in subjective listening test results. This paper presents WARP-Q, a full-reference objective speech quality metric that uses a dynamic time warping cost for MFCC representations of the signals. It is robust to low perceptual signal changes introduced by low bit rate neural vocoders. An evaluation using waveform matching, parametric, and generative neural vocoder-based codecs as well as channel and environmental noise shows that WARP-Q has better correlation and codec quality ranking for novel codecs compared to traditional metrics as well as the versatility of capturing other types of degradations, such as additive noise and transmission channel degradations.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.