Significance of parametric spectral ratio methods in detection and recognition of whispered speech

Arpit Mathur,Rajesh M Hegde,Shankar M Reddy

doi:10.1186/1687-6180-2012-157

Arpit Mathur, Rajesh M Hegde + Show 1 more

Open Access

https://doi.org/10.1186/1687-6180-2012-157

Copy DOI

Abstract

Abstract In this article the significance of a new parametric spectral ratio method that can be used to detect whispered speech segments within normally phonated speech is described. Adaptation methods based on the maximum likelihood linear regression (MLLR) are then used to realize a mismatched train-test style speech recognition system. This proposed parametric spectral ratio method computes a ratio spectrum of the linear prediction (LP) and the minimum variance distortion-less response (MVDR) methods. The smoothed ratio spectrum is then used to detect whispered segments of speech within neutral speech segments effectively. The proposed LP-MVDR ratio method exhibits robustness at different SNRs as indicated by the whisper diarization experiments conducted on the CHAINS and the cell phone whispered speech corpus. The proposed method also performs reasonably better than the conventional methods for whisper detection. In order to integrate the proposed whisper detection method into a conventional speech recognition engine with minimal changes, adaptation methods based on the MLLR are used herein. The hidden Markov models corresponding to neutral mode speech are adapted to the whispered mode speech data in the whispered regions as detected by the proposed ratio method. The performance of this method is first evaluated on whispered speech data from the CHAINS corpus. The second set of experiments are conducted on the cell phone corpus of whispered speech. This corpus is collected using a set up that is used commercially for handling public transactions. The proposed whisper speech recognition system exhibits reasonably better performance when compared to several conventional methods. The results shown indicate the possibility of a whispered speech recognition system for cell phone based transactions.

Highlights

Speech has been the most primitive modes of communication between all higher forms of life
To calculate whisper diarization error rate (WDER), individual thresholds are set on each method such that all the methods give equal true positive rate (TPR) and the false positive rate (FPR) at this TPR is calculated from the Receiver operating characteristic (ROC) curve
linear prediction (LP)-minimum variance distortion-less response (MVDR) shows a sharp rise in TPR with little rise in FPR in all cases

Summary

Introduction

Speech has been the most primitive modes of communication between all higher forms of life. It is interesting to note that even while the basic organs that regulate our speech are the same, speech varies with the speaker. This difference is accounted for by prosody which is defined as a science of pitch, loudness, tempo, rhythm and intonation of speech. By and large models based on a large collection of regional databases have to whispered speech. This form is known as esophageal speech

Methods

Results

Discussion

Conclusion