Enhancing Cross-Linguistic Image Caption Generation with Indian Multilingual Voice Interfaces using Deep Learning Techniques

Vijay A Sangolgi,Mithun B Patil,Shubham S Vidap,Satyam S Doijode,Swayam Y Mulmane,Aditya S Vadaje

doi:10.1016/j.procs.2024.03.244

Abstract

The Multilingual Voice-Based Image Caption Generator (MVBICG) is a versatile tool with numerous applications spanning communications, culture preservation, business, and technology, making it indispensable in the interconnected world. The task of image caption generation combines computer vision and NLP (natural language processing) concepts, enabling the system to understand the details or complexities of the image context and describe them in natural language. Image descriptions serve as an invaluable solution for visually impaired individuals. The MVBICG system is designed to provide real-time image descriptions in the form of voice in multiple languages as per user requirements. With the use of an MVBICG, the descriptions can be obtained as a voice output in different languages. Converting a voice into multiple languages with the help of the Google Translate API is often referred to as “multilingual voice conversion” or “multilingual speech synthesis." It leverages the latest advancements in deep learning, particularly convolutional neural networks (CNNs) for image feature extraction and recurrent neural networks (RNNs) with attention mechanisms for natural language generation. In the future, image processing is expected to take center stage as a critical research domain primarily dedicated to the preservation and protection of human lives. The MVBICG demonstrates remarkable performance with BLEU scores of 0.483601 for BLEU-1 and 0.320112 for BLEU-2, indicating its proficiency in generating precise and contextually relevant image captions. These scores further underscore its value in bridging language barriers and enhancing accessibility, highlighting its potential for broader societal impact. Additionally, the system's training progress is illustrated by a loss plot, showing the convergence of the model over time. As image processing continues to advance, the MVBICG emerges as a pivotal research domain, focusing on the preservation and safeguarding of human lives through advanced technologies.

Full Text