Abstract

Diplophonia is a type of disordered voice in which two simultaneous pitches are perceived. Most commonly in diplophonic voices, the vocal folds are divided into two parts that vibrate at different frequencies. The glottal area is the projected area of the space between the vocal folds. The glottal area in time is referred to as the glottal area waveform (GAW). The GAW is modeled for diplophonic voice by superimposing two partial GAWs (pGAWs) that are trains of single-peak pulses with different pulse frequencies, i.e., fundamental frequencies ( $f_o$ s). In current kinematic models of diplophonic vocal fold vibration, the pGAWs are assumed to be quasiperiodic. This assumption is mitigated here by modulating pulse-to-pulse cycle length and amplitude. Both random and deterministic modulations are considered. Deterministic modulations depend on the difference of the pGAWs’ instantaneous phases. Model GAWs are fitted to input GAWs using an analysis-by-synthesis approach which we refer to as ‘modulated pulse trains decomposition’ (MPD). MPD is shown to be applicable to diplophonic as well as to nondiplophonic types of dysphonia, which include multi-pulse patterns, random timing behaviours, and chaos. It is mostly robust against modulations but degraded by large random modulations. MPD is compared to a deep autoencoder neural network, and the WaveGlow neural network. In terms of time-domain fitting errors, MPD outperforms the other two approaches unless random modulations are large. MPD outperforms the best of the other two approaches by up to approximately 5 dB. For large random modulations, the deep autoencoder network achieves the smallest fitting errors. In terms of magnitude spectrum fitting errors, WaveGlow is superior except for natural input GAWs containing only nondiplophonic types of dysphonia. Also pulse timing errors are shown to be advantageous for MPD.

Highlights

  • D ISORDERS of the human voice associated with degraded voice quality and disability to talk normally may result in reduced job opportunities, loss of quality of life, and even socialManuscript received August 28, 2020; revised November 23, 2020 and January 14, 2021; accepted January 17, 2021

  • Up to two candidate partial GAWs (pGAWs) are added together and compared to the input glottal area waveform (GAW). pGAW candidates are selected, which belong to the model GAW that is optimal in a least squares sense

  • For synthetic input GAWs, Le is reported with regard to the sizes of deterministic and random modulations

Read more

Summary

INTRODUCTION

D ISORDERS of the human voice associated with degraded voice quality and disability to talk normally may result in reduced job opportunities, loss of quality of life, and even social. Diplophonia is a type of voice in which two simultaneous pitches are auditorily perceived It may be a sign of a voice disorder and is most commonly caused by the spatial splitting of the vocal folds into two parts, each of which vibrates at a different fundamental frequency (fo). Modulation analysis is included in a separate step that refines the model GAW, where the time instances and magnitudes of the pGAW pulses’ maxima are optimized systematically pulse-by-pulse. The optima of both the maximas’ time instances and magnitudes are found by minimizing the time-domain fitting error.

RELATED WORK ON VOCAL FOLD MODELING AND SPEECH SYNTHESIS
CORPORA
Synthesis
ANALYSIS-BY-SYNTHESIS
Candidate Selection
Modulation of pGAW Pulse Times and Heights
Deep Autoencoder
Waveglow
RESULTS
DISCUSSION AND CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call