Abstract
Nowadays, although automatic speech recognition has become quite proficient in recognizing or transcribing well-prepared fluent speech, the transcription of speech that contains many disfluencies remains problematic, such as spontaneous conversational and lecture speech. Filled pauses (FPs) are the most frequently occurring disfluencies in this type of speech. Most recent studies have shown that FPs are widely believed to increase the error rates for state-of-the-art speech transcription, primarily because most FPs are not well annotated or provided in training data transcriptions and because of the similarities in acoustic characteristics between FPs and some common non-content words. To enhance the speech transcription system, we propose a new automatic refinement approach to detect FPs in British English lecture speech transcription. This approach combines the pronunciation probabilities for each word in the dictionary and acoustic language model scores for FP refinement through a modified speech recognition forced-alignment framework. We evaluate the proposed approach on the Reith Lectures speech transcription task, in which only imperfect training transcriptions are available. Successful results are achieved for both the development and evaluation datasets. Acoustic models trained on different styles of speech genres have been investigated with respect to FP refinement. To further validate the effectiveness of the proposed approach, speech transcription performance has also been examined using systems built on training data transcriptions with and without FP refinement.
Highlights
Speech disfluencies are common phenomena in spontaneous and lecture speech [1]
To examine the quality of transcriptions derived from the lightly supervised decoding system with acoustic model (AM) trained on different speech genres, Table 1 presents the results for the bbc.dev dataset using Switchboard-I corpus (SWB)-filled pauses (FPs).AM and Broadcast News (BN)-FP.AM, which were the same AMs used in the SWB-FP and BN-FP systems, respectively
After a deep analysis of those deleted and inserted words, we found that the increased deletions and insertions produced by BN-FP.AM primarily derive from the confusion between FPs and other normal words
Summary
Speech disfluencies are common phenomena in spontaneous and lecture speech (e.g., filled pauses, repetitions, and repairs) [1]. The most frequently occurring disfluencies are filled pauses (FPs), especially when the topic is unfamiliar and when speakers are uncertain or need to make decisions. FPs are an integral part of how human speak, can provide valuable information about the speaker’s cognitive state, and can be critical for successful turntaking [2]. For automatic speech transcription systems, FPs have been shown to be problematic because they can be confused with and recognized as small functional words, usually resulting in fragment-like structures that increase transcription error rates [3,4,5,6]. Consideration of how to handle FPs is indispensable to the development of robust speech transcription.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.