Abstract

Automatic speaker identification has become a challenging research problem due to its wide variety of applications. Neural networks and audio-visual identification systems can be very powerful, but they have limitations related to the number of speakers. The performance drops gradually as more and more users are registered with the system. This paper proposes a scalable algorithm for real-time text-independent speaker identification based on vowel recognition. Vowel formants are unique across different speakers and reflect the vocal tract information of a particular speaker. The contribution of this paper is the design of a scalable system based on vowel formant filters and a scoring scheme for classification of an unseen instance. Mel-Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) have both been analysed for comparison to extract vowel formants by windowing the given signal. All formants are filtered by known formant frequencies to separate the vowel formants for further processing. The formant frequencies of each speaker are collected during the training phase. A test signal is also processed in the same way to find vowel formants and compare them with the saved vowel formants to identify the speaker for the current signal. A score-based scheme allows the speaker with the highest matching formants to own the current signal. This model requires less than 100 bytes of data to be saved for each speaker to be identified, and can identify the speaker within a second. Tests conducted on multiple databases show that this score-based scheme outperforms the back propagation neural network and Gaussian mixture models. Usually, the longer the speech files, the more significant were the improvements in accuracy.

Highlights

  • The term Speaker Recognition [1] consists of Speaker Identification – the identification of the speaker speaking the current utterance – and Speaker Verification – the verification from the utterance of whether the speaker is who he claims to be

  • We compare the score-based scheme proposed in this paper against Back Propagation Neural Network (BPNN) and Gaussian Mixture Model (GMM) using several databases such as YOHO, NIST, TI_digits1 and TI_digits2

  • We present the results for all four databases and compare out proposed scheme to both BPNN and GMM-universal background model (UBM)

Read more

Summary

Introduction

The term Speaker Recognition [1] consists of Speaker Identification – the identification of the speaker speaking the current utterance – and Speaker Verification – the verification from the utterance of whether the speaker is who he claims to be. The current approach is aimed at Text-independent Speaker Identification. Each sample is represented by one or more bytes (e.g. one byte for a 256-level quantisation). This digitised discrete-time signal consists of different frequency values which represent the audio signal. It must be pre-processed to extract feature vectors that represent individual information for a particular speaker regardless of the content of the speech itself. A learning algorithm generalises these feature vectors for various speakers during training and verifies the speaker identity for a test signal during the test phase

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call