Abstract

Practical speech recognizers must accept normal conversational voice input (including hesitations). However, most automatic speech recognition work has involved read speech, whose acoustic aspects differ significantly from speech found in actual dialogues. Hesitations, filled pauses, and restarts (after aborted utterances) are common in natural speech, yet few recognition systems handle such disfluencies with any degree of success. Among other problems, filled pauses (e.g., ‘‘uhh,’’ ‘‘umm’’), unlike silences, resemble phones as part of words in continuous speech. The work reported here further develops techniques to allow identification of filled pauses. A distinction is made between disfluencies in actual dialogs (e.g., in the Switchboard database of natural telephone conversations, which have poor recognition rates so far) and simulated ones (e.g., the ATIS Wizard-of-Oz-style database of airline travel inquiries). It appears that speaking with actual people influences disfluencies, e.g., filled pauses tend to be shorter and more variable in pitch patterns, although unfilled pauses adjacent to filled ones remain important in both styles. While most automatic recognition methods rely entirely on spectral envelope (e.g., low-order cepstral coefficients), identifying hesitation phenomena seems to require use of fundamental frequency and duration in addition to such spectral parameters.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.