TTS Engines Research Articles

Object detection is an important computer vision technique that has increasingly attracted the attention of researchers in recent years. The literature to date in the field has introduced a range of object detection models. However, these models have largely been English-language-based, and there is only a limited number of published studies that have addressed how object detection can be implemented for the Arabic language. As far as we are aware, the generation of an Arabic text-to-speech engine to utter objects’ names and their positions in images to help Arabic-speaking visually impaired people has not been investigated previously. Therefore, in this study, we propose an object detection and segmentation model based on the Mask R-CNN algorithm that is capable of identifying and locating different objects in images, then uttering their names and positions in Arabic. The proposed model was trained on the Pascal VOC 2007 and 2012 datasets and evaluated on the Pascal VOC 2007 testing set. We believe that this is one of a few studies that uses these datasets to train and test the Mask R-CNN model. The performance of the proposed object detection model was evaluated and compared with previous object detection models in the literature, and the results demonstrated its superiority and ability to achieve an accuracy of 83.9%. Moreover, experiments were conducted to evaluate the performance of the incorporated translator and TTS engines, and the results showed that the proposed model could be effective in helping Arabic-speaking visually impaired people understand the content of digital images.

Read full abstract

Recent trends in voicebot application development have enabled utilization of both speech-to-text and text-to-speech (TTS) generation techniques. In order to generate a voice response to a given speech, one needs to use a TTS engine. The recently developed TTS engines are shifting towards end-to-end approaches utilizing models such as Tacotron, Tacotron-2, WaveNet, and WaveGlow. The reason is that it enables a TTS service provider to focus on developing training and validating datasets comprising of labelled texts and recorded speeches instead of designing an entirely new model that outperforms the others which is time-consuming and costly. In this context, this work introduces the first Vietnamese FPT Open Speech Data (FOSD)-Tacotron-2-based TTS model dataset. This dataset comprises of a configuration file in *.json format; training and validating text input files (in *.csv format); a 225,000-step checkpoint of the trained model; and several sample generated audios. The published dataset is extremely worth for serving as a model for benchmarking with other newly developed TTS models / engines. In addition, it opens an entirely new TTS research optimization problem to be addressed: How to effectively generate speech from text given: a black box TTS (trained) model and its training and validation input texts.

Read full abstract

TTS Engines Research Articles

Related Topics

Articles published on TTS Engines

Object Recognition System for the Visually Impaired: A Deep Learning Approach using Arabic Annotation

The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset.

On the intelligibility of fast synthesized speech for people who are blind: A cross-system comparison.

Method and system for delivering text-to-speech in a real time telephony environment

Assessment of the intelligibility and perceived quality of speech produced by text-to-speech engines

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

TTS Engines Research Articles

Related Topics

Articles published on TTS Engines

Object Recognition System for the Visually Impaired: A Deep Learning Approach using Arabic Annotation

The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset.

On the intelligibility of fast synthesized speech for people who are blind: A cross-system comparison.

Method and system for delivering text-to-speech in a real time telephony environment

Assessment of the intelligibility and perceived quality of speech produced by text-to-speech engines