Abstract

Automatic voice pathology detection enables objective assessment of pathologies that affect the voice production mechanism. Detection systems have been developed using the traditional pipeline approach (consisting of the feature extraction part and the detection part) and using the modern deep learning -based end-to-end approach. Due to the lack of vast amounts of training data in the study area of pathological voice, the former approach is still a valid choice. In the existing detection systems based on the traditional pipeline approach, the mel-frequency cepstral coefficient (MFCC) features can be regarded as the defacto standard feature set. In this study, automatic voice pathology detection is investigated by comparing the performance of various MFCC variants derived by considering two factors: the input and the filterbank in the cepstrum computation. For the first factor, three inputs (the voice signal, the glottal source and the vocal tract) are compared. The glottal source and the vocal tract are estimated using the quasi-closed phase glottal inverse filtering method. For the second factor, the mel-frequency and linear-frequency filterbanks are compared. Experiments were conducted separately using six databases consisting of voices produced by speakers suffering from one of four disorders (dysphonia, Parkinson’s disease, laryngitis, or heart failure) and by healthy speakers. Support vector machine (SVM) was used as the classifier. The results show that by combining mel- and linear-frequency cepstral coefficients derived from the glottal source and vocal tract, better overall detection accuracy was obtained compared to the defacto MFCC features derived from the voice signal. Furthermore, this combination provided comparable or better performance than four existing cepstral feature extraction techniques in clean and high signal-to-noise ratio (SNR) conditions.

Highlights

  • V OICE pathologies arise either due to physical changes in the voice production mechanism [1], [2] or due to improper vocal use when the physical structure of the mechanism is normal [3]–[5]

  • In the automatic detection of voice pathology, traditional pipeline systems based on using a separate feature extraction stage and a separate classification stage is still a valid system architecture, despite the fact that modern end-to-end systems provide excellent detection accuracy

  • Many studies have been published in automatic detection of voice pathologies by using Support vector machine (SVM)-based traditional pipeline systems [13], [16], [17], [58], [59]

Read more

Summary

Introduction

V OICE pathologies arise either due to physical changes in the voice production mechanism (e.g., in the respiratory system, vocal folds, and vocal tract) [1], [2] or due to improper vocal use when the physical structure of the mechanism is normal (e.g., vocal fatigue or ventricular phonation) [3]–[5]. Voice pathology may indicate early neurodegenerative disease such as Parkinson’s disease (PD) [10]–[12], [14]. Existing voice pathology detection systems can be divided into two categories: traditional pipeline systems and modern end-to-end systems [15]. The traditional pipeline system consists of two components [15], [16]: the feature extraction part and the detection part. The feature extraction part tries to capture discriminative information from acoustic voice signal waveforms by representing this information in compressed forms using a set of pre-defined features. The feature sets reported in the literature for voice pathology detection can be grouped into four categories: (1) perturbation measures (such as jitter and shimmer); (2) spectral and cepstral measures (such as mel-frequency cepstral coefficients (MFCC), linear predic-

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call