Abstract
Most state-of-the-art large vocabulary continuous speech recognition systems employ context dependent (CD) phone units, however, the CD phone units are not efficient in capturing long-term spectral dependencies of tone in most tone languages. The Standard Yorùbá (SY) is a language composed of syllable with tones and requires different method for the acoustic modeling. In this paper, a context dependent tone acoustic model was developed. Tone unit is assumed as syllables, amplitude magnified difference function (AMDF) was used to derive the utterance wide F contour, followed by automatic syllabification and tri-syllable forced alignment with speech phonetization alignment and syllabification SPPAS tool. For classification of the context dependent (CD) tone, slope and intercept of F values were extracted from each segmented unit. Supervised clustering scheme was utilized to partition CD tri-tone based on category and normalized based on some statistics to derive the acoustic feature vectors. Multi-class support vector machine (MSVM) was used for tri-tone training. From the experimental results, it was observed that the word recognition accuracy obtained from the MSVM tri-tone system based on dynamic programming tone embedded features was comparable with phone features. A best parameter tuning was obtained for 10-fold cross validation and overall accuracy was 97.5678%. In term of word error rate (WER), the MSVM CD tri-tone system outperforms the hidden Markov model tri-phone system with WER of 44.47%.Keywords: Syllabification, Standard Yorùbá, Context Dependent Tone, Tri-tone Recognition
Highlights
In recent times Automatic Speech Recognition (ASR) has been of special interest to researchers; its application domain has expanded from simplest system of digit recognition to portable cross-language spontaneous dialogue systems, such development is mainly due to the improvement in computational power and modeling approaches for representing speech signal
The results shows that the speech recognizer built upon the HMM/SVM segmentation outperforms the one built upon the generalized learning segmentation in terms of word error rate (WER) by about 0.05%, on a noisy data
The Multi-class support vector machine (MSVM) approach to context dependent tone recognition is suitable for the current study
Summary
In recent times Automatic Speech Recognition (ASR) has been of special interest to researchers; its application domain has expanded from simplest system of digit recognition to portable cross-language spontaneous dialogue systems, such development is mainly due to the improvement in computational power and modeling approaches for representing speech signal. Tone languages denote a large proportion of the spoken languages of the world and yet lexical tone is an understudied features This is attributed to the unsettled questions on building of the vocabulary, what should constitute the sub-word units, how structures over these units are parameterized, modeled and trained. Several models have been proposed for tone language ASR These techniques can be categorized into two main classes: (i) rule-based and (ii) data-based approach. A drawback of this scheme, is the generation, organization and representation of the interdependency of the rule-set as well as unavailability of domain experts These setbacks inspired the use of the data-driven techniques to ASR (Kumalalo et al, 2010). The number of CD tri-tone are limited reducing model confusability when compared to CD tri-phone which requires a lot of hours of segmented and labelled speech unit. The objective of this paper is to develop a tri-tone acoustic model and explore the use sub-segmental features for SY CD tone identification
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Journal of Applied Sciences and Environmental Management
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.