Abstract

Speaker identification (SI) is an emerging area of research in the domain of digital speech processing. SI is the process of classification of speakers based on their speech features extracted from the speech utterances. After the recent developments of deep learning (DL) models, deep convolutional neural networks (DCNNs) have been widely used for solving the SI tasks. A CNN model consists of mainly two parts, deep convolutional feature extraction and classification. However, the training process of DCNN models is computationally expensive and time consuming. Therefore, in terms of reducing the computational cost of training a DCNN model an ensemble of machine learning (ML) models is proposed for the speaker identification task. The ensemble model classifies the speakers based on the deep convolutional features extracted from the input speech features. In this work, the deep convolutional features are extracted from mel-spectrograms in terms of flatten vectors (FVs) by utilizing a pre-trained DCNN model (VGG-16 model). The machine learning models that are considered for the hard voting ensemble approach are random forest (RF), extreme gradient boosting (XGBoost) and support vector machine (SVM). The models are trained and tested with voice conversion challenge (VCC) 2016 mono-lingual speech data, VCC 2020 multi-lingual speech data and multi-lingual emotional speech data (ESD). Moreover, three data augmentation techniques are used for increasing the samples of the speech data namely pitch-scaling, amplitude scaling and polarity inversion. The accuracy obtained from the proposed approach is significantly higher than the individual ML models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call