Abstract

Image-to-speech systems are a type of technology allowing for the conversion of visual information, such as images or videos, into auditory output. These systems use complex algorithms and machine learning techniques to recognize and describe visual content, allowing individuals who are visually impaired or blind to access in-formation that would otherwise be inaccessible to them. Image-to-speech systems are becoming increasingly sophisticated and can be integrated into a variety of devices, from smartphones to smart glasses. This article presents an approach to improving the accuracy of the image-to-speech system by incorporating multiple techniques. The proposed system begins by using Tesseract, an optical character recognition (OCR) engine, to extract text infor-mation from images. However, OCR is often imperfect and produces errors, which can impact the accuracy of image-to-speech models. To address this issue, the Text-Davinci-002 engine was applied for post-processing OCR output, which can help to correct errors and improve the accuracy of the extracted text. Finally, the Microsoft Speech API was employed in order to generate speech from the extracted text. By integrating these three techniques, image-to-speech system accuracy was significantly improved. An example of the generated synthetic dataset showed that the proposed techniques improve image-to-speech system accuracy both on word and character levels, and also perform punctuation error correction. This approach can be useful in various applications, including reading text from images, translating written text to speech, and assisting people with visual im-pairments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call