Abstract

In this paper, we study an application of transfer learning approach to speaker’s age and gender recognition task. Recently, speech analysis systems, which take images of log Mel-spectrograms or MFCCs as input for classification, are gaining popularity. Therefore, we used pretrained models that showed good performance on ImageNet task, such as AlexNet, VGG-16, ResNet18, ResNet34, ResNet50, as well as state-of-the-art EfficientNet-B4 from Google. Additionally, we trained 1D CNN and TDNN models for speaker’s age and gender recognition. We compared performance of these models in age (4 classes), gender (3 classes) and joint age and gender (7 classes) recognition. Despite high performance of pretrained models in ImageNet task, our TDNN models showed better UAR results in all tasks presented in this study: age (UAR = 51.719%), gender (UAR = 81.746%) and joint age and gender (UAR = 48.969%) recognition.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.