Abstract

Speech production can be regarded as a process where a time-varying vocal tract system (filter) is excited by a time-varying excitation. In addition to its linguistic message, the speech signal also carries information about, for example, the gender and age of the speaker. Moreover, the speech signal includes acoustical cues about several speaker traits, such as the emotional state and the state of health of the speaker. In order to understand the production of these acoustical cues by the human speech production mechanism and utilize this information in speech technology, it is necessary to extract features describing both the excitation and the filter of the human speech production mechanism. While the methods to estimate and parameterize the vocal tract system are well established, the excitation appears less studied. This article provides a review of signal processing approaches used for the extraction of excitation information from speech. This article highlights the importance of excitation information in the analysis and classification of phonation type and vocal emotions, in the analysis of nonverbal laughter sounds, and in studying pathological voices. Furthermore, recent developments of deep learning techniques in the context of extraction and utilization of the excitation information are discussed.

Highlights

  • Speech is the most sophisticated means of communication among people

  • The values of F0, glottal closure instant (GCI), and GOI extracted from dEGG are used as the ground truth in evaluating the corresponding features extracted from the acoustic speech signal

  • The CP analysis is based on computing the vocal tract transfer function with LP using the covariance criterion that is computed from speech samples in the CP of the glottal cycle

Read more

Summary

INTRODUCTION

Speech is the most sophisticated means of communication among people. The carrier of speech is the acoustic speech pressure signal. Studies have shown that understanding the excitation component helps in generating acoustical cues of different voice qualities [11]–[13] and vocal emotions [14]–[20], as well as in the production of different paralinguistic and nonverbal sounds [21]–[23]. To F0, one of the most important features is the strong impulse-like component that is present in each cycle of the glottal flow waveform in the production of voiced speech This impulse-like component is caused by the sudden deceleration of the air flow in the vicinity of the GCI due to adduction of the vocal folds. The list of abbreviations used in this article is given in Nomenclature

HUMANSPEECHPRODUCTION MECHANISM
EXTRACTIONOFEXCITATIONINFORM AT IONFROM SPEECH SIGNALS
Extraction of Excitation Information Using GIF
Extraction of F0
Extraction of GCI
Extraction of GOI
UTILIZATIONOFEXCITATIONINFORM AT IONINDIFFERENTAREAS OF SPEECH RESEARCH
Study of Phonation Types
Study of Vocal Emotions
Study of Laughter Sounds
Study of Pathological Voices
RECENTTRENDSINEXTRACTIONANDUTILIZ AT IONOFEXCI TAT IONINFORM AT I O N
Deep Learning for GIF and for Extraction of F0 and GCI
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call