Abstract

The internet is saturated with images that convey messages and emotions more effectively than words alone in today's digital age. Individuals with visual impairments, who are unable to perceive and comprehend these images, face significant obstacles in this visual-centric online environment. As there are millions of visually impaired people around the globe, it is essential to close this accessibility gap and enable them to interact with online visual content. We propose a novel model for neural image caption generation with visual attention to address this pressing issue. Our model uses a combination of CNNs and RNNs to convert the content of images into aural descriptions, making them accessible to the visually impaired. The primary objective of our project is to generate captions that accurately and effectively describe the visual elements of an image. The model proposed operates in two phases. First, a text-to-speech API is utilized to convert the image's content into a textual description. The extracted textual description is then converted to audio, allowing visually impaired individuals to perceive visual information through sound. Through exhaustive experimentation and evaluation, we intend to achieve a high level of precision and descriptivism in our system for image captioning. We will evaluate the performance of the model by undertaking comprehensive qualitative and quantitative assessments, comparing its generated captions to ground truth captions annotated by humans. By enabling visually impaired individuals to access and comprehend online images, our research promotes digital inclusion and equality. It has the potential to improve the online experience for millions of visually impaired people, enabling them to interact with visual content and enriching their lives through meaningful image-based interactions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call