Abstract

Understanding foreign languages can be challenging for individuals living in India's diverse linguistic landscapes. We propose a new technology that utilizes machine translation to address this issue, specifically focusing on speech recognition and synthesis. It aims to convert online video resources into Indian languages by integrating open-source technologies like text-to-speech (TTS), speech-to-text (STT) systems, and FFmpeg library to separate or augment audio and video. We used the whisper model, the application that can read up to 60 different Languages in the form of audio as input, and it transcripts the audio into text with segments of sentences based on timestamps. The sentence-based transcription generated by whisper is then translated into the desired language using Google Cloud translate_v2. Later, Each timestamp was individually converted into audio using the Google Cloud text-to-speech service, ensuring the audio fits inside the length of its respective timestamp. The individual audio segments are then augmented to generate the final audio in the desired language. Finally, the audio is attached to the original video, ensuring video-audio synchronization. The accuracy of the translation was verified by comparing the naturalness of the audio with general spoken language standards. This application benefits visually impaired individuals and those who cannot read text, providing them with a means to acquire knowledge in their native languages.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call