This study tackles language barriers in computer-mediated communication by developing an application that integrates OpenAI’s Whisper ASR model and Google Translate machine translation to enable real-time, continuous speech transcription and translation and the processing of video and audio files. The application was developed using the Experimental method, incorporating standards for testing and evaluation. The integration expanded language coverage to 133 languages and improved translation accuracy. Efficiency was enhanced through the use of greedy parameters and the Faster Whisper model. Usability evaluations, based on questionnaires, revealed that the application is efficient, effective, and user-friendly, though minor issues in user satisfaction were noted. Overall, the Speech Translate application shows potential in facilitating transcription and translation for video content, especially for language learners and individuals with disabilities. Additionally, this study introduces an Arabic learning game incorporating an Artificial Neural Network using the CNN algorithm. Focusing on the “Speaking” skill, the game applies to voice and image extraction techniques, achieving a high accuracy rate of 95.52%. This game offers an engaging and interactive method for learning Arabic, a language often considered challenging. The incorporation of Artificial Neural Network technology enhances the effectiveness of the learning game, providing users with a unique and innovative language learning experience. By combining voice and image extraction techniques, the game offers a comprehensive approach to enjoyably improving Arabic speaking skills.