Abstract

Speaker identification refers to the process of recognizing human voice using artificial intelligence techniques. Speaker identification technologies are widely applied in voice authentication, security and surveillance, electronic voice eavesdropping, and identity verification. In the speaker identification process, extracting discriminative and salient features from speaker utterances is an important task to accurately identify speakers. Various features for speaker identification have been recently proposed by researchers. Most studies on speaker identification have utilized short-time features, such as perceptual linear predictive (PLP) coefficients and Mel frequency cepstral coefficients (MFCC), due to their capability to capture the repetitive nature and efficiency of signals. Various studies have shown the effectiveness of MFCC features in correctly identifying speakers. However, the performances of these features degrade on complex speech datasets, and therefore, these features fail to accurately identify speaker characteristics. To address this problem, this study proposes a novel fusion of MFCC and time-based features (MFCCT), which combines the effectiveness of MFCC and time-domain features to improve the accuracy of text-independent speaker identification (SI) systems. The extracted MFCCT features were fed as input to a deep neural network (DNN) to construct the speaker identification model. Results showed that the proposed MFCCT features coupled with DNN outperformed existing baseline MFCC and time-domain features on the LibriSpeech dataset. In addition, DNN obtained better classification results compared with five machine learning algorithms that were recently utilized in speaker recognition. Moreover, this study evaluated the effectiveness of one-level and two-level classification methods for speaker identification. The experimental results showed that two-level classification presented better results than one-level classification. The proposed features and classification model for identifying a speaker can be widely applied to different types of speaker datasets.

Highlights

  • Automatic speaker identification (ASI) is the process of extracting the identity of a speaker by using a machine from a group of familiar speech signals

  • The results of the performance comparison of the proposed MFCC and time-based features (MFCCT) features with baseline features were obtained

  • The experimental results of this study show that the proposed MFCCT features and deep neural network (DNN) can classify speaker utterances with an overall accuracy between 83.5% and 92.9%

Read more

Summary

Introduction

Automatic speaker identification (ASI) is the process of extracting the identity of a speaker by using a machine from a group of familiar speech signals. Speech signals are powerful media of communication that always convey rich and useful information, such as emotion, gender, accent, and other unique characteristics of a speaker. The associate editor coordinating the review of this manuscript and approving it for publication was Ahmed Farouk. These unique characteristics enable researchers to distinguish among speakers when calls are conducted over phones the speakers are not physically present. Through such characteristics, machines can become familiar with the utterances of speakers, similar to humans. Speaker utterances are trained with machine learning algorithms from the collected dataset, and speakers are identified using the test utterances

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call