Abstract

This paper presents an enhancement system for early stage Spanish Esophageal Speech (ES) vowels. The system decomposes the input ES into neoglottal waveform and vocal tract filter components using Iterative Adaptive Inverse Filtering (IAIF). The neoglottal waveform is further decomposed into fundamental frequency F0, Harmonic to Noise Ratio (HNR), and neoglottal source spectrum. The enhanced neoglottal source signal is constructed using a natural glottal flow pulse computed from real speech. The F0 and HNR are replaced with natural speech F0 and HNR. The vocal tract formant frequencies (spectral peaks) and bandwidths are smoothed, the formants are shifted downward using second order frequency warping polynomial and the bandwidth is increased to make it close to the natural speech. The system is evaluated using subjective listening tests on the Spanish ES vowels /a/, /e/, /i/, /o/, /u/. The Mean Opinion Score (MOS) shows significant improvement in the overall quality (naturalness and intelligibility) of the vowels. Index Terms: speech enhancement, glottal flow, analysis synthesis vocal tract, spectral sharpening, warping

Highlights

  • The removal of the larynx after a Total Laryngectomy (TL), changes the speech production mechanism

  • In order to deal with these deficiencies, this paper proposes an Esophageal Speech (ES) enhancement method based on the GlottHMM single pulse synthesis [15, 16, 17]

  • The vocal tract spectrum of ES has the following characteristics, i) higher frequencies are emphasized more compared to lower frequencies, ii) spectral resonances are moved to higher frequencies, and iii) resonance bandwidths are reduced in comparison to normal speech vowels

Read more

Summary

Introduction

The removal of the larynx after a Total Laryngectomy (TL), changes the speech production mechanism. Compared to the production of normal speech according to the source-filter model [1], the voicing source in ES is severely altered and does not have any fundamental frequency or harmonic components. The ES can be enhanced by transforming the source and filter components to those of normal speech using signal processing algorithms. In [7] the source and filter components were modified by replacing the source with the LF model and increasing the bandwidth of filter formants for better quality speech. The vocal tract formants are typically considered to be the same as in normal speech signals. The spectral peaks of the vocal tract filter are moved to lower frequencies in order to compensate the rising of formant in ES.

System Description
GlottHMM based analysis
Neoglottal source signal enhancement
Vocal tract modification by nonlinear frequency warping
Synthesis of enhanced speech
System Evaluation
Original
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.