Abstract
We present a technique for real time deep learning based scene image detection and segmentation and neural text-to-speech (TTS) synthesis; to detect, classify and segment images in real time views and generate their corresponding speeches. In this work, we show improvement to the existing convolutional neural network approach for a single-model neural text to speech synthesis with an extension to object segmentation features in a given scene. This model, built on top of a high effective and efficient building block of a trained neural network model (masked R-CNN), generates as output, high precision images with bounding boxes and a significant audio signal quality improvement on the corresponding images detected in real time views. We show that a convolutional neural network model combined with neural TTS system can detect, classify and segment multiple objects in a single scene with their various bounding boxes, unique voices and display them in real life time. We applied transfer learning technique on the base model for the image detection, classification, and segmentation tasks. This work introduces a powerful image-to-speech tracking system with instant object segmentation which could be valuable for pixel level image to image measurement in a real time view for easy navigation.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.