Abstract
Perceptual evaluation of the patient’s voice is the most commonly used method in everyday clinical practice. We propose an automatic approach for the prediction of severity of some types of organic and functional dysphonia. By means of an unsupervised learning method, we have demonstrated that acoustic parameters measured on different phonetic classes are suitable for modelling the four grade assessments of the specialists (RBH subjective scale from 0 to 3). In this study, the overall hoarseness H was examined. Four specialists were asked to determine the severity of dysphonia. A k-means cluster analysis was performed for the decision of each specialist separately; the average accuracy of the four-grade classification was 0.46. The four-grade classification has been surprisingly close to the subjective judgements. Moreover, automatic estimation of severity of dysphonia was also determined. Linear regression and RBF kernel regression models were compared. The average rating of the four specialists were used as target in the experiments. Low RMSE and high correlation measures were obtained between the automatically predicted severity and perceptual assessments. The best RMS value of H was 0.45 for the model with RBF kernel, however, a simpler linear model provided the highest correlation value of 0.85, using only eight acoustic parameters.
Highlights
Dysphonia refers to the dysfunction in the ability to produce voice
In Tulics and Vicsi (2017) we demonstrated that these parameters correlate with the severity of dysphonia, as well as Soft Phonation Index (SPI) and Empirical mode decomposition (EMD) based frequency band ratios acoustic parameters measured on different phonetic classes
Soft Phonation Index (SPI) and Empirical mode decomposition (EMD) based frequency band ratios were measured on the voiced parts of speech, and the measured parameter were grouped into different phonetic classes
Summary
Dysphonia refers to the dysfunction in the ability to produce voice. Perceptually, dysphonia can be characterized by hoarse, breathy, harsh or rough vocal qualities, but some kind of phonation remains (Hirschberg et al 2013). Acoustic measures are derived from sustained vowel samples; continuous speech has several advantages over analysis of sustained vowels It contains a variation of fundamental frequency, pauses and phonation onsets, and there is the opportunity to examine different variations of speech sounds. The most widely used acoustic parameters regarding dysphonia include: jitter, shimmer and Harmonics-to-Noise Ratio (HNR) Zhang and his colleagues in Zhang and Jiang (2008) found that jitter and shimmer statistically differentiate between normal and pathological sustained vowels but did not show such a significant difference between normal and pathological continuous speech. Our previous research has confirmed that acoustic parameters like jitter, shimmer, HNR and the first component (c1) of the mel-frequency cepstral coefficients (referred to as ‘mfcc01’) are useful in the automatic classification of healthy and pathological voices using continuous speech (Vicsi et al 2011; Kazinczi et al 2015; Grygiel et al 2012).
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have