Whispered Speech Detection Using Glottal Flow-Based Features

Khomdet Phapatanaburi,Prawit Buayai,Longbiao Wang,Peerapong Uthansakul,Wongsathon Pathonsuwan,Talit Jumphoo,Patikorn Anchuen,Monthippa Uthansakul

doi:10.3390/sym14040777

Abstract

Recent studies have reported that the performance of Automatic Speech Recognition (ASR) technologies designed for normal speech notably deteriorates when it is evaluated by whispered speech. Therefore, the detection of whispered speech is useful in order to attenuate the mismatch between training and testing situations. This paper proposes two new Glottal Flow (GF)-based features, namely, GF-based Mel-Frequency Cepstral Coefficient (GF-MFCC) as a magnitude-based feature and GF-based relative phase (GF-RP) as a phase-based feature for whispered speech detection. The main contribution of the proposed features is to extract magnitude and phase information obtained by the GF signal. In the GF-MFCC, Mel-frequency cepstral coefficient (MFCC) feature extraction is modified using the estimated GF signal derived from the iterative adaptive inverse filtering as the input to replace the raw speech signal. In a similar way, the GF-RP feature is the modification of the relative phase (RP) feature extraction by using the GF signal instead of the raw speech signal. The whispered speech production provides lower amplitude from the glottal source than normal speech production, thus, the whispered speech via Discrete Fourier Transformation (DFT) provides the lower magnitude and phase information, which make it different from a normal speech. Therefore, it is hypothesized that two types of our proposed features are useful for whispered speech detection. In addition, using the individual GF-MFCC/GF-RP feature, the feature-level and score-level combination are also proposed to further improve the detection performance. The performance of the proposed features and combinations in this study is investigated using the CHAIN corpus. The proposed GF-MFCC outperforms MFCC, while GF-RP has a higher performance than the RP. Further improved results are obtained via the feature-level combination of MFCC and GF-MFCC (MFCC&GF-MFCC)/RP and GF-RP(RP&GF-RP) compared with using either one alone. In addition, the combined score of MFCC&GF-MFCC and RP&GF-RP gives the best frame-level accuracy of 95.01% and the utterance-level accuracy of 100%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Symmetry	Publication Date: Apr 8, 2022
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Whispered Speech Detection Using Glottal Flow-Based Features

Abstract

Talk to us

Similar Papers

More From: Symmetry

Lead the way for us

Similar Papers

Significance of relative phase features for shouted and normal speech classification
Khomdet Phapatanaburi ... Peerapong Uthansakul
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2024
Khomdet Phapatanaburi, et. al.Khomdet Phapatanaburi ... Peerapong Uthansakul
06 Jan 2024
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2024

Power Spectrum Difference Teager Energy Features for Speech Recognition in Noisy Environment
N S Nehe ... R.S Holambe
-
N S Nehe, et. al.N S Nehe ... R.S Holambe
01 Dec 2008
01 Dec 2008

Whispered Speech Conversion Based on the Inversion of Mel Frequency Cepstral Coefficient Features
Qiang Zhu ... Yunfeng Dou
Algorithms | VOL. 15
Qiang Zhu, et. al.Qiang Zhu ... Yunfeng Dou
20 Feb 2022
Algorithms | VOL. 15

Exploiting Variable length Teager Energy Operator in melcepstral features for person recognition from humming
Maulik C Madhavi ... Hemant A Patil
-
Maulik C Madhavi, et. al.Maulik C Madhavi ... Hemant A Patil
01 Sep 2014
01 Sep 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Whispered Speech Detection Using Glottal Flow-Based Features

Abstract

Talk to us

Similar Papers

More From: Symmetry