Detecting hesitations in the automatic recognition of spontaneous speech

Douglas O’Shaughnessy,Hesham Tolba,Weiying Li,Rachid El Meliani,Zhong-Hua Wang,Clark Z Lee

doi:10.1121/1.423590

Abstract

Practical speech recognizers must accept normal conversational voice input (including hesitations). However, most automatic speech recognition work has concentrated on read speech, whose acoustic aspects differ significantly from speech found in actual dialogues. Hesitations, filled pauses, and restarts (after aborted utterances) are common in natural speech, yet few recognition systems handle such disfluencies with any degree of success. Among other problems, filled pauses (e.g., ‘‘uhh,’’ ‘‘umm’’), unlike silences, resemble phones as part of words in continuous speech. The work reported here further develops techniques to allow identification of filled pauses. The problem of finding and correcting restarts is also examined, i.e., not just determining where the speech interruption occurs, but also estimating which words are undesired. The Switchboard database (of natural telephone conversations, yielding relatively poor recognition rates to date) provided data for the study. While most automatic reognition methods rely entirely on spectral envelope (e.g., low-order cepstral coefficients), identifying hesitation phoenemena requires using a combination of spectra, fundamental frequency and duration.

Full Text