Abstract

The focus in automatic speech recognition (ASR) research has gradually shifted from read speech to spontaneous speech. ASR systems can reach an accuracy of above 90% when evaluated on read speech, but the accuracy of spontaneous speech is much lower. This high error rate is due in part to the poor modeling of pronunciations within spontaneous speech. An analysis of pronunciation variations at the acoustic level reveals that pronunciation variations include both complete changes and partial changes. Complete changes are the replacement of a canonical phoneme by another alternative phone, such as ‘b’ being pronounced as ‘p’. Partial changes are variations within the phoneme and include diacritics, such as nasalization, centralization, voiceless, voiced, etc. Most of the current work in pronunciation modeling attempts to represent pronunciation variations either by alternative phonetic representations or by the concatenation of subphone units at the state level. In this dissertation, we show that partial changes are a lot less clear-cut than previously assumed and cannot be modeled by mere representation in alternate or concatenation of phone units. When partial changes occur, a phone is not completely substituted, deleted or inserted, and the acoustic representation at the phone level is often ambiguous. We suggest that in addition to phonetic representations of pronunciation variations, the ambiguity of acoustic representations caused by partial changes should be taken into account. The acoustic model for spontaneous speech should be different from that of read and planned speech—it should have a strong ability to cover partial changes. We propose modeling partial changes by combing the pronunciation model with acoustic model at the state level. Based on this pronunciation model, we reconstruct the acoustic model to improve its resolution without sacrificing the model's identity with the goal of accommodating pronunciation variations. The effectiveness of this approach was evaluated on the Hub4NE Mandarin Broadcast News Corpus with different styles of speech. It has been proven that the new pronunciation modeling approach does not help much for pre-planned speech, but it provides a significant gain for spontaneous speech. To our best knowledge, this dissertation is the first of its kind that systemically investigates both complete changes and partial changes in spontaneous Mandarin speech. The results reported in this dissertation demonstrate that our approaches are both efficient and effective.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call