MF-Saudi: A multimodal framework for bridging the gap between audio and textual data for Saudi dialect detection

Raed Alharbi

doi:10.1016/j.jksuci.2024.102084

Abstract

Detecting variations in dialects within a language can be challenging, particularly in regions with rich linguistic diversity like Saudi Arabia. To our knowledge, no prior attempts have been made to develop a multimodal, audio–textual framework for Saudi dialect detection. The current approaches often concentrate on detecting dialects only based on audio or textual data, which fails to capture the complex relationship between both modalities. In this paper, we propose a novel Multimodal Framework, called MF-Saudi, for Saudi dialect detection. The framework consists of three main components: (1) a pretrained BERT encoder for extracting and encoding textual information; (2) an acoustic model for representing audio signals and fusing them with textual information via the fusion layer; and (3) an alignment learning module to develop meaningful representations that capture the complexities of audio–text relationships, resulting in improved dialect detection. We conduct empirical evaluations on a real-world dataset, demonstrating that our solution outperforms some of the state-of-the-art baseline methods. The experiment’s code can be found here: https://github.com/raed19/MF-Saudi.

Full Text