Audio-textual multi-label demographic recognition of Arabic speakers using deep learning

Sadam Al-Azani,Ridha Almeshari,El-Sayed El-Alfy

doi:10.3233/jifs-219389

Abstract

Speaker demographic recognition and segmentation analytics play a key role in offering personalized experiences across different automated industries and businesses. This paper aims at developing a multi-label demographic recognition system for Arabic speakers from audio and associated textual modalities. The system can detect age groups, genders, and dialects, but it can be easily extended to incorporate more demographic traits. The proposed method is based on deep learning for feature learning and recognition. Representations of audio modality are learned through 3D spectrogram and AlexNet CNN-based architecture. An AraBERT transformer is employed for learning representations of the textual modality. Additionally, a method is provided for fusing audio and textual representations. The effectiveness of the proposed method is evaluated using the Saudi Audio Dataset for Arabic (SADA), which is a recently published database containing audio recordings of TV shows in different Arabic dialects. The experimental findings show that when using models with standalone modalities for multi-label demographic classification, textual modality using AraBERT performed better than the audio modality represented using 3D spectrogram along with AlexNet CNN-based architecture. Furthermore, when combining both modalities, audio and textual, significant improvement has been attained for all demographic traits.

Full Text