Acoustic Pronunciation Variations Modeling for Standard Malay Speech Recognition

Noraini Seman,Kamaruzaman Jusoff

doi:10.5539/cis.v1n4p112

Noraini Seman, Kamaruzaman Jusoff

Open Access

https://doi.org/10.5539/cis.v1n4p112

Copy DOI

Abstract

This paper presents different methods of handling pronunciation variations in Standard Malay (SM) speech recognition. Pronunciation variation can be handled by explicitly modifying the knowledge sources or improving the decoding method. Two types of pronunciation variations are defined, namely, complete or phone changes and partial or sound changes. Complete or phone change means that one phoneme is realized as another phoneme. Meanwhile, a partial or sound change happens when the acoustic realization is ambiguous between two phonemes. Complete or phone changes can be handled by constructing a pronunciation variation dictionary to include alternative pronunciations at the lexical level or dynamically expanding the search space to include those pronunciation variants. Sound or partial changes can be handled by adjusting the acoustic models through sharing or adaptation of the Gaussian mixture components. Experimental results show that the use of a pronunciation variation dictionary and the method of dynamic search space expansion can improve speech recognition performance substantially. The methods of acoustic model refinement were found to be relatively less effective in our experiments.

Highlights

November, 2008 changes are very common in spontaneous speech
It is clear that in going from isolated words to conversational or spontaneous speech the amount pronunciation increases. This is because spontaneous speech contains much more phone change phenomena and sound change phenomena because of variable speaking rates, moods, emotions, prosodies, co-articulations and so on, even when the speaker is tending to utter in canonical pronunciations (Greenberg, 1999)
The acoustic models are a set of hidden Markov models (HMM) that characterize the statistical variations of input speech

Summary

Development of speech database

In building the Standard Malay speech database, the selections of utterances are derived from Buletin Utama TV3 Broadcast News that contains about 550 utterances for four hours news. We trained the recognition model based on syllables that formed by concatenating three types of phonological units: the Initial, the Middle and the Final that represented as a sequence of (SM) characters as shows in Table 1 and Table 2. We extract 12 mel-frequency cepstral coefficients (MFCCs) with a logarithmic energy for every 10 ms analysis frame, and concatenate their first and second derivatives to obtain a 39-dimensional feature vector. We apply cepstral mean normalization and energy normalization to the feature vectors. The whole training procedure is divided into two stages, where monophone and triphone stages should be applied. The acoustic models are based on 3-state left to right, context-dependent, 4-mixture, and cross-word triphone models, trained using the HTK toolkit (Young et al, 2001)

Recognition modeling and implementation

Results and discussions

Conclusion