An end-to-end Tacotron model versus pre trained Tacotron model for Arabic text-to-speech synthesis

A.M Mutawa

doi:10.1016/j.jer.2023.08.016

Abstract

Text-to-Speech (TTS) systems turn normal text into spoken language, which is important for accessibility and user interaction. Many of these systems make speech from phonetic or phonemic transcriptions, but another way is to make speech by connecting together pre-recorded units from a database. The size of the units varies, from diphones to whole phrases. Even though this method covers a lot of ground, it sometimes needs more clarity, especially when high-quality output requires storing whole words or phrases in certain situations. Synthesizers can also use the way humans talk and the way their vocal tracts work to make voices. The Arabic language is hard to develop TTS methods for, as a result of its complicated morphology, semantic nuances, and many different dialects. These dialects often have a lot of differences from standard Arabic and don't follow formal rules for spelling. This means that traditional Arabic that hasn't been edited often has spelling and grammar mistakes. In this study, we show and test a Tacotron model that was made just for Arabic TTS synthesis from beginning to end. This model uses the richness of acoustic information in audio files, such as frequency and pitch, to make naturalistic speech that sounds a lot like what humans say. We also compare the performance of this model with that of a pre-trained Tacotron model applied to Arabic text. This gives us important information about how well Arabic TTS systems work and where they could be improved.

Full Text