Perceptual non‐intrusive speech quality assessment using a self‐organizing map

Abdulhussain E Mahdi

doi:10.1108/17410390610645058

Abstract

PurposeThis paper seeks to propose a new non‐intrusive method for the assessment of speech quality of voice communication systems and evaluate its performance.Design/methodology/approachThe method is based on measuring perception‐based objective auditory distances between the voiced parts of the output speech to appropriately matching references extracted from a pre‐formulated codebook. The codebook is formed by optimally clustering a large number of parametric speech vectors extracted from a database of clean speech records. The auditory distances are then mapped into equivalent subjective mean opinion scores (MOSs). The required clustering and matching processes are achieved by an efficient data‐mining tool known as the self‐organizing map (SOM). The proposed method was examined using a wide range of distortion including speech compression, wireless channel impairments, VoIP channel impairments, and modifications to the signal from features such as AGC.FindingsThe experimental results reported indicate that the proposed method provides a high level of accuracy in predicting the actual subjective quality of the speech. Specifically, the second version of the method, which is based on the use of bark spectrum (BS) analysis, is more accurate in predicting the MOS scores compared with its first and third versions (which are based on BS analysis and mel frequency cepstrum coefficients (MFCC), respectively), and outperforms the ITU‐T PESQ in a large number of test cases, particularly those related to distortion caused by channel impairments and signal level modifications.Research limitations/implicationsIt is believed that the prototype developed of the proposed objective speech quality measure is sufficiently accurate and robust against speaker, utterance and distortion type variations. Nevertheless, there are still possible directions for further improvements and enhancement. In general there are three areas that could be pursued for further improvements: widening the coverage of speaker variations of the system's codebook; formulating and using a perceptual speech model that provides true speaker‐independent representation of speech; and implementing the proposed measure as a stand‐alone system, preferably for real‐time applications.Practical implicationsBeing an output‐based method, the proposed method can be employed for monitoring and assessing telecommunications networks under both live traffic conditions and off‐line evaluation.Originality/valueThe main contribution of this paper is the introduction of a new output‐based, non‐intrusive method for the assessment of speech quality that is sufficiently accurate and robust. To the best of the author's knowledge, no reliable output‐based objective speech quality assessment method has to date been reported or formally recognised.

Full Text