Speech quality assessment with WARP‐Q: From similarity to subsequence dynamic time warp cost

Wissam A Jassim,Michael Chinen,Andrew Hines,Jan Skoglund

doi:10.1049/sil2.12151

Wissam A Jassim, Michael Chinen + Show 2 more

https://doi.org/10.1049/sil2.12151

Copy DOI

Abstract

Speech coding has been shown to achieve good speech quality using either waveform matching or parametric reconstruction. For very low bit rate streams, recently developed generative speech models can reconstruct high-quality wideband speech from the bit streams of standard parametric encoders at less than 3 kb/s. Generative codecs produce high-quality speech based on synthesising speech from a DNN and the parametric input. Existing objective speech quality models (e.g., ViSQOL and POLQA) cannot be used to accurately evaluate the quality of coded speech from generative models as they penalise based on signal differences not apparent in subjective listening test results. This paper presents WARP-Q, a full-reference objective speech quality metric that uses a dynamic time warping cost for MFCC representations of the signals. It is robust to low perceptual signal changes introduced by low bit rate neural vocoders. An evaluation using waveform matching, parametric, and generative neural vocoder-based codecs as well as channel and environmental noise shows that WARP-Q has better correlation and codec quality ranking for novel codecs compared to traditional metrics as well as the versatility of capturing other types of degradations, such as additive noise and transmission channel degradations.

Full Text