Abstract

Speech signal is enriched with plenty of features used for biometrical recognition and other applications like gender and emotional recognition. Channel conditions manifested by background noise and reverberation are the main challenges causing feature shifts in the test and training data. In this paper, a hybrid speaker identification model for consistent speech features and high recognition accuracy is made. Features using Mel frequency spectrum coefficients (MFCC) have been improved by incorporating a pitch frequency coefficient from speech time domain analysis. In order to enhance noise immunity, we proposed a single hidden layer feed-forward neural network (FFNN) tuned by an optimized particle swarm optimization (OPSO) algorithm. The proposed model is tested using 10-fold cross-validation over different levels of Adaptive White Gaussian Noise (AWGN) (0-50 dB). A recognition accuracy of 97.83% was obtained from the proposed model in clean voice environments. However, a noisy channel is realized with lesser impact on the proposed model as compared with other baseline classifiers such as plain-FFNN, random forest (RF), K-nearest neighbour (KNN), and support vector machine (SVM).

Highlights

  • Voice is the oldest method of communication reported in human history on earth

  • In [14], bottleneck features are extracted from speech signals by using a deep neural network; the same is concatenated with Mel frequency spectrum coefficients (MFCC) features for speaker identification improvement

  • As discussed in the previous sections, the feed-forward neural network is examined under several performance metrics in order to identify the best model that is capable of predicting the speaker identity

Read more

Summary

Introduction

Voice is the oldest method of communication reported in human history on earth. It was enforced by the fact that humans continuously need to share their feelings and requirements for surviving. Text-dependent speaker recognition is implementable using the time domain analysis, and the drawback of this method is that complete matching between the test and train data is required which is practically not possible [7, 8]. Feature sections from the fused sets of features are performed using kernel-based learning (e.g., support vector machine (SVM)) and using the reduced features of the SR model [15]. Another approach is illustrated in [16] for SR performance enhancement using speech data from different channels for constructing the acoustic features. A feed-forward neural network is used in an optimized version for serving the required recognition purpose

Voice Processor
Hybrid Speech Features
Feature Mapping
Model Optimizer
Results and Discussion
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.