Abstract

Real-time monitoring of speech quality for VoIP calls is a significant challenge. This paper presents early work on a no-reference objective model for quantifying perceived speech quality in VoIP. The overall approach uses a modular design that will be able to help pinpoint the reason for degradations as well as quantifying their impact on speech quality. The model is being designed to work with narrowband and wideband signals. This initial work is focused on rating amplitude clipped or chopped speech, which are common problems in VoIP. A model sensitive to each of these degradations is presented and then tested with both synthetic and real examples of chopped and clipped speech. The results were compared with predicted MOS outputs from four objective speech quality models: ViSQOL, PESQ, POLQA and P.563. The model output showed consistent relationships between this model's clip and chop detection modules and the quality predictions from the other objective speech quality models. Further work is planned to widen the range of degradation types captured by the model, such as non-stationary background noise and speaker echo. While other components (e.g. a voice activity detector) would be necessary to deploy the model for stand-alone VoIP monitoring, the results show good potential for using the model in a realtime monitoring tool.

Highlights

  • As digital communication has become more pervasive, the variety of channels for human speech communication has grown

  • The level of clipping increases from left to right on the x-axis and the y-axis shows the model output score. The trends in both the quiet and additive pink noise show clipping begins to be detected at clip level of around 0.55 times peak amplitude. This is a 12 dB peak-to-average ratio which was reported by Kates (1994) to be the level at which clipped speech is indistinguishable from unclipped speech

  • The clip and chop measurement models for speech quality presented in this paper show promising early results and compares favourably to the other no-reference objective speech quality model

Read more

Summary

Introduction

As digital communication has become more pervasive, the variety of channels for human speech communication has grown. Full reference objective models, such as PESQ [2] and POLQA [3], predict speech quality by comparing a reference speech signal to a received signal and quantifying the difference between them Such models can be applied to system optimisation but are constrained by the requirement to have access to the original signal, which is not always practical for realtime monitoring systems. The individual modules will be combined to produce an aggregate objective speech quality prediction score The novelty of this approach over other NR models [4, 5 and 6] is that each module provides a unidimensional quality index feeding into the overall metric but can provide diagnostic information about the cause of the degradation for narrowband. The paper concludes with a description of the stages in the overall model development

Amplitude Clipped Speech
Choppy Speech
Amplitude Clipped Speech Detection Model
Choppy Speech Detection Model
Stimuli
Model Comparison
Choppy Speech Detection Test
Conclusions and future work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.