Abstract

Three major areas have been the focus in the literature to improve ASR performance, namely enhanced acoustic modeling, use of new acoustic features and contributions to the language modeling. An aspect that is less frequently considered is the effect of incorrect transcription. The objective of this paper is to address this issue and correct transcriptions during training. The phonetic transcriptions delivered with a corpus are often hand-labeled and thus suffer from human error owing to short duration of phonemes. The phonetic sequence can also be generated by forced aligning one of the pronunciations for the word in the lexicon that best matches acoustic sequence. In either case, the pronunciation generated may not match the actual acoustic sequence. An attempt is made to increase the likelihood of the transcript and the acoustic feature by systematically removing vowels from the transcription that are not articulated. To identify vowels that are missing in the utterance, but are in the transcript, a group delay (GD)-based boundary detection technique is employed. Group delay is a signal processing-based vowel detector and is independent of the transcription. Viterbi forced alignment (VFA) is also used to obtain the acoustic syllable boundary using the phonetic transcription. Deviations in the acoustic syllable boundaries obtained from GD and VFA are further confirmed by a silence–vowel–consonant classifier. The corrected transcription thus generated is found to increase the log likelihood of the transcript with respect to the acoustic feature leading to a relative improvement of 2.8% phone error rate on the TIMIT corpus.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call