Substitution Error Analysis for Improving the Word Accuracy in Telugu Language Automatic Speech Recognition System

M Nagamani

doi:10.9790/0661-0340710

Abstract

Use of natural languages for the computer communication is one of the current research topics. The speech recognition plays a central role in communicating the computer by means of speech. Speech Recognition is the process of converting analog signal into the symbolic gesture form known as text. An Automatic Speech Recognition (ASR) is the process of converting input speech signal given to the system, into text. This input signal is any human spoken word. Though the last Four decades, research is going on bringing the system perception near to the human being, in recognition word accuracy. Many domains play a role in degrading the of system performance in which language and its pronunciation variants caused by different reasons like accent, gender, dialects of the speech are in general factors. In specific, system environment, articulatory phonetics, acoustic phonetics, (acoustic system), lexical model (pronunciation dictionary) and language model including the mood of the speaker. Acoustic modeling can be done in most of the cases by using signal processing, where as lexicon model require a sufficient human intervention as, it is based on the language and human perception. Once sufficient primary data is built then automatic processing can be done using different modeling techniques to derive more data for proper training of the speech recognition system. Here the linguistics are play a major role in building the robust system. In lexical model design language plays a major role. Each language has its own rhythm in speech and language aspects. Based on language rhythm worldly languages are classified as stress timed and syllable timed rhythm. Most of ASR systems use the lexical model that is built for stressed timed languages. All Indian languages are syllable timed rhythmic languages one such is Telugu Language. In this paper analysis of the decoding results of ASR system using two different lexical model environments. One is CMU lexicon which is based on stress timed language as the tool is used American accent English phonemes and another UOH lexicon which is handcraft lexicon for Telugu language which is also a syllable timed language.. Further studied are the gender and accents (pronunciation variant factors) effecting the Substitutional errors in ASR system. The confusion matrix for vowel and consonants alone analyzed for both cases and also for isolated word recognition where the confusion matrix gives the most common phonemes substituted. In all the cases the UOH lexicon based ASR system gives the improvement of word accuracy around 20 to 30%. Speech is a process used to communicate from a speaker to listener. Pronunciation relates to speech, and humans have an intuitive feel for pronunciation. For instance, people chuckle when words are mispronounced and notice when foreign accent colors a speaker's pronunciations.(1). If the words were always pronounced in the same way, ASR would be relatively easy. However, for various reasons words are almost always pronounced differently and varied from one speaker to another and from once situation to another. The variability is due to co-articulation, reasonal accents, speaking rate, speaking style etc.

Full Text