Abstract

In this study, a speaker identification system is considered consisting of a feature extraction stage which utilizes both power normalized cepstral coefficients (PNCCs) and Mel frequency cepstral coefficients (MFCC). Normalization is applied by employing cepstral mean and variance normalization (CMVN) and feature warping (FW), together with acoustic modeling using a Gaussian mixture model-universal background model (GMM-UBM). The main contributions are comprehensive evaluations of the effect of both additive white Gaussian noise (AWGN) and non-stationary noise (NSN) (with and without a G.712 type handset) upon identification performance. In particular, three NSN types with varying signal to noise ratios (SNRs) were tested corresponding to street traffic, a bus interior, and a crowded talking environment. The performance evaluation also considered the effect of late fusion techniques based on score fusion, namely, mean, maximum, and linear weighted sum fusion. The databases employed were TIMIT, SITW, and NIST 2008; and 120 speakers were selected from each database to yield 3600 speech utterances. As recommendations from the study, mean fusion is found to yield overall best performance in terms of speaker identification accuracy (SIA) with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings.

Highlights

  • Speaker identification is one important application of biometrics and forensics to identify speakers based on their unique voice pattern [1,2,3]

  • 2008 without handset and noise In this subsection, Table 2 shows the relationship between speaker identification accuracy (SIA) and Gaussian mixture components (GMCs) for the three databases according to feature combinations, based on Mel frequency cepstral coefficients (MFCC) and power normalized cepstral coefficients (PNCCs) features, and various fusion schemes are considered

  • The National Institute of Standards and Technology (NIST) 2008 had the lowest SIA among all other databases at 30 dB with 26.67% this was reduced to the 3.33% at 10 dB, as such all databases were affected by stationary noise, with a constant spectrum profile

Read more

Summary

Introduction

Speaker identification is one important application of biometrics and forensics to identify speakers based on their unique voice pattern [1,2,3]. In [16], both the NIST 2008 and TIMIT databases were employed to achieve robust speaker identification and mitigate room reverberation and additive noise, but again handset effects were ignored. Various neural network-based approaches were proposed in [18], without considering different noise and handset conditions. Increasing the number of speakers reduced the recognition rate, and there was no testing under realistic noise and channel distortion conditions. In this work we extend our previous work in [28, 29] with four combinations of features and their score fusion methods for the original recordings; and with AWGN, and three types of NSN: street traffic, bus interior and crowd talk, with and without the G.712 type handset at 16 kHz, to provide a wide range of environmental noise conditions.

An overview of a robust biometric speaker identification system
Fusion strategies
Databases and simulation setups
Methods
Quantitative perspective for noise and handset effects in part B
Related works based on the proposed speaker identification system
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call