A novel framework for automatic caption and audio generation

Chaitanya Kulkarni,P Monika,Preeti B,Shruthi S

doi:10.1016/j.matpr.2022.05.380

Abstract

In recent times, with the advancements of Artificial Intelligence, multidomain problems of Computer vision (CV) and Natural language processing (NLP) such as Image caption generation, audio generation has piqued the attention of researchers all across the globe due to its application in medical, business and technology. Image caption generating entails automatically generating text according to the contents available in the image. In this paper we present a novel caption generation and audio generation framework. We use Deep Neural Networks like Convolutional Neural Network (CNN), Long short-term memory (LSTM) and transfer learning techniques to perform this task. The model has two stages: 1] Generate captions for any given image 2] Then gTTS (google text to speech) generator is used to generate audio for the generated captions. This framework is extremely beneficial to visually impaired people since it allows them to comprehend visuals. The Flickr8K dataset was used to train and test the model. A total of 6000 photos were utilised to train the model, with an additional 1000 images used for validation and testing.

Full Text