Abstract
Text to Speech plays a vital role in imparting information to the general population who have difficulty reading text but can understand spoken language. In Bhutan, many people fall in this category in adopting the national language ‘Dzongkha’ and system of such kind will have advantages in the community. In addition, the language will heighten its digital evolution in narrowing the digital gap. The same is more important in helping people with visual impairment. Text to speech systems are widely used in talking BOTs to news readers and announcement systems. This paper presents an attempt towards developing a working model of Text to Speech system for Dzongkha language. It also presents the development of a transcription or grapheme table for phonetic transcription from Dzongkha text to its equivalent phone set. The transcription tables for both consonants and vowels have been prepared in such a way that it facilitates better compatibility in computing. A total of 3000 sentences have been manually transcribed and recorded with a single male voice. The speech synthesis is based on a statistical method with concatenative speech generation on FESTIVAL platform. The model is generated using the two variants CLUSTERGEN and CLUNITS of the FESTIVAL speech tools FESTVOX. The development of system prototype is of the first kind for the Dzongkha language. Keywords: Natural Language processing (NLP), Dzongkha, Text to speech (TTS) system, Statistical speech synthesis, phoneme, corpus, transcription DOI: 10.7176/CEIS/12-1-04 Publication date: January 31 st 2021
Highlights
Natural language processing is key to understanding the human language in depth as it facilitates better analysis and is effective in performing analysis on a large amount of data in very little time
A method using a large corpus of Dzongkha text and speech, and using statistical method based on speech tools in FESTIVAL platform is used to analyse and extract features and synthesize speech
With the parameter generated from the text and speech, a phoneme concatenative method is used to get to the desired string of spoken words and sentences
Summary
Natural language processing is key to understanding the human language in depth as it facilitates better analysis and is effective in performing analysis on a large amount of data in very little time. TTS is broadly divided into two parts, text processing and speech generation. Text processing and speech generation can be performed in two methods, with large database (Black & Hunt 1996) or statistical modelling method (Black 2007; Zen 2007). Database method requires more computation with more data set but gives better results even though it breaks when new words or sentences are provided (Jamtsho & Muneesawang 2020). Statistical method requires lesser data to create a model of the language and is dynamic in adapting to newer words and sentences. A good TTS system employs a combination of the two methods. With the parameter generated from the text and speech, a phoneme concatenative method is used to get to the desired string of spoken words and sentences. The system can be used to promote safety driving, e-learning and education toys for kids
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have