BEYOND WORDS: HARNESSING SPEECH SOUND FOR SPEAKER AGE AND GENDER DETECTION USING 1D CNN ARCHITECTURE WITH SELF-ATTENTION MECHANISM

Alia Abdulhassan,Umniah Jaid

doi:10.5455/jjcit.71-1703265368

Abstract

Beyond the immediate content of speech, the voice can provide rich information about a speaker's demographics, including age and gender. Estimating a speaker's age and gender offers a wide range of applications, spanning from voice forensic analysis to personalized advertising, healthcare monitoring, and human-computer interaction. However, pinpointing precise age remains intricate due to age ambiguity. Specifically, utterances from individuals at adjacent ages are frequently indistinguishable. Addressing this, we propose a novel, end-to-end approach that deploys Mozilla's Common Voice dataset to transform raw audio into high-quality feature representations using Wav2Vec2.0 embeddings. These are then channeled into our self-attention-based convolutional neural network (CNN) model. To address age ambiguity, we evaluate the effects of different loss functions such as focal loss and Kullback-Leibler (KL) divergence loss. Additionally, we evaluate the accuracy of the estimation at different durations of speech. Experimental results from the Common Voice dataset underscore the efficacy of our approach, showcasing an accuracy of 87% for male speakers, 91% for female speakers and 89% overall accuracy, and an accuracy of 99.1% for gender prediction.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

BEYOND WORDS: HARNESSING SPEECH SOUND FOR SPEAKER AGE AND GENDER DETECTION USING 1D CNN ARCHITECTURE WITH SELF-ATTENTION MECHANISM

Abstract

Talk to us

Similar Papers

More From: Jordanian Journal of Computers and Information Technology

Lead the way for us

Similar Papers

Electrocardiogram heartbeat classification based on a deep convolutional neural network and focal loss
Taissir Fekih Romdhane ... Mohamed Atri
Computers in Biology and Medicine | VOL. 123
Taissir Fekih Romdhane, et. al.Taissir Fekih Romdhane ... Mohamed Atri
05 Jul 2020
Computers in Biology and Medicine | VOL. 123

Dermoscopy lesion classification based on GANs and a fuzzy rank-based ensemble of CNN models
Haiyan Li ... Jun Chang
Physics in Medicine & Biology | VOL. 67
Haiyan Li, et. al.Haiyan Li ... Jun Chang
08 Sep 2022
Physics in Medicine & Biology | VOL. 67

Convolutional neural networks with refined loss functions for the real-time crash risk analysis
Rongjie Yu ... Liqiang Wang
Transportation Research Part C: Emerging Technologies | VOL. 119
Rongjie Yu, et. al.Rongjie Yu ... Liqiang Wang
06 Aug 2020
Transportation Research Part C: Emerging Technologies | VOL. 119

Noise-robust pipe wall-thinning discrimination system using convolution recurrent neural network model
Jaehan Park ... Soo Young Shin
Engineering Applications of Artificial Intelligence | VOL. 133
Jaehan Park, et. al.Jaehan Park ... Soo Young Shin
28 Mar 2024
Engineering Applications of Artificial Intelligence | VOL. 133

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

BEYOND WORDS: HARNESSING SPEECH SOUND FOR SPEAKER AGE AND GENDER DETECTION USING 1D CNN ARCHITECTURE WITH SELF-ATTENTION MECHANISM

Abstract

Talk to us

Similar Papers

More From: Jordanian Journal of Computers and Information Technology