Psychological Stress Detection in Speech Using Return-to-opening Phase Ratios in Glottis

Miroslav Stanek,Milan Sigmund

doi:10.5755/j01.eie.21.5.13336

Abstract

This paper is focused on investigation of psychological stress in speech signal using shapes of normalised glottal pulses. The pulses were estimated by two algorithms: the Direct Inverse Filtering and Iterative and Adaptive Inverse Filtering. Normalised glottal pulses are divided into opening and return phase, and a feature vector characterizing each glottal pulse is calculated for a series of n percentage interval in time domain. Each feature vector is created by parameters describing its return to opening phase ratio, namely chosen intervals, kurtosis, skewness, and area. Further, psychological stress is detected by feature vector and four different classifiers. Experimental results show, that the best accuracy approaching 95 % is reached with Gaussian Mixture Models classifier. All the best results were obtained regarding only the interval of 5 % from both phase durations, i.e. for and after pulse peak, where the most significant differences between normal and stressed speech in feature vector are occurred. Presented experiments were performed on our own speech database containing both real stressed speech and normal speech.DOI: http://dx.doi.org/10.5755/j01.eee.21.5.13336

Highlights

The first application of glottal pulses can be found in speech synthesis where precise understanding of glottal pulses and its estimation lead to high-quality synthetic speech
Another method of speech synthesis was introduced by Raitio et al [2], where synthetic voice is utilized by Hidden Markov Models (HMM) and Iterative and Adaptive Inverse Filtering (IAIF) leading to subjectively highly natural synthetic speech
Similar HMM-based speech synthesizer based on the Liljencrants-Fant (LF) model of the glottal flow is published by Cabral et al [3]

Summary

INTRODUCTION

The first application of glottal pulses can be found in speech synthesis where precise understanding of glottal pulses and its estimation lead to high-quality synthetic speech. By suitable combination of mixed excitation model and noise component, the high-quality speech can be produced by the GSS method using suitable combination of mixed excitation model and noise components. The application field of glottal pulses is so-called expressive speech processing used for expressing emotions, dynamic and varying voice quality and articulation during. In 1980, the dynamic changes varying on phonation type, exactly on glottal source signal, was published by Laver [5]. Using the suitable combination of prosodic and glottal features for emotion recognition is described in [7], where Support Vector Machine (SVM), Artificial Neural Network and Gaussian Mixture Models (GMM) classifiers were applied on Berlin emotional speech database. For instance, Glottal Flow Cepstrum Coefficients [13] and vocal source model [14] were experimentally tested in the case of speaker recognition. A survey oriented on glottal source processing and its applications was written by Drugman et al [20]

MINING THE GLOTTAL PULSES

GLOTTAL FEATURE EXTRACTION

REAL STRESS DATABASE

EXPERIMENTAL RESULTS

CONCLUSIONS