Abstract
Speaker age and gender classification is one of the most challenging problems in speech signal processing. Recently with developing technologies, identifying speaker age and gender information has become a necessity for speaker verification and identification systems such as identifying suspects in criminal cases, improving human–machine interaction, and adapting music for awaiting people queue. Despite the intensive studies that have been conducted to extract descriptive and distinctive features, the classification accuracies are still not satisfactory. In this work, a model for generating bottleneck features from a deep neural network and a Gaussian Mixture Model–Universal Background Model (GMM–UBM) classifier are proposed for speaker age and gender classification problem. Deep neural network with a bottleneck layer is trained in an unsupervised manner for calculating the initial weights between layers. Then, it is trained and tuned in a supervised manner to generate transformed mel-frequency cepstral coefficients (T-MFCCs). The GMM–UBM is used to build a GMM model for each class, and the models are used to classify speaker age and gender. Age-annotated database of German telephone speech (aGender) is used to evaluate the proposed classification system. The newly generated T-MFCCs have shown potential to achieve significant classification improvements in speaker age and gender classification by using the GMM–UBM classifier. The proposed classification system achieved an overall accuracy of 57.63%. The highest accuracy is calculated as 72.97% for adult female speakers.
Highlights
This study focuses on generating an efficient and robust feature set by using deep bottleneck feature extractor (DBF) from a DNN and designs a Gaussian Mixture Model– Universal Background Model (GMM–UBM) classifier for speaker age and gender classification
A model for generating bottleneck features from a deep neural network and a Gaussian Mixture Model– Universal Background Model (GMM–UBM) classifier are proposed for speaker age and gender classification problem
This study focuses on generating an efficient and robust feature set by using deep bottleneck feature extractor (DBF) from a DNN and designs a GMM–UBM classifier for speaker age and gender classification
Summary
This study focuses on generating an efficient and robust feature set by using deep bottleneck feature extractor (DBF) from a DNN and designs a GMM–UBM classifier for speaker age and gender classification. Spectral and temporal features include MFCCs [2, 8,9,10,11,12, 50], formant frequencies, fundamental frequency (F0) [2, 9], energy, relative spectral transform (RASTA) [2], RASTA–perceptual linear prediction (RASTA-PLP) [2], jitter and shimmer [8,9,10, 13], speech rate [13], harmony, pitch range (PR) [2, 14], and zero-crossing rates They are used in age and gender identification and classification systems by previous studies. A combination of several glottal, spectral, and prosodic feature sets was used in their system Their system achieved an overall accuracy of 42.2% by the GMM–UBM classifier.
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have