Abstract
Environmental noise and reverberation conditions severely degrade the performance of forensic speaker verification. Robust feature extraction plays an important role in improving forensic speaker verification performance. This paper investigates the effectiveness of combining features, mel frequency cepstral coefficients (MFCCs), and MFCC extracted from the discrete wavelet transform (DWT) of the speech, with and without feature warping for improving modern identity-vector (i-vector)-based speaker verification performance in the presence of noise and reverberation. The performance of i-vector speaker verification was evaluated using different feature extraction techniques: MFCC, feature-warped MFCC, DWT-MFCC, feature-warped DWT-MFCC, a fusion of DWT-MFCC and MFCC features, and fusion feature-warped DWT-MFCC and feature-warped MFCC features. We evaluated the performance of i-vector speaker verification using the Australian Forensic Voice Comparison and QUT-NOISE databases in the presence of noise, reverberation, and noisy and reverberation conditions. Our results indicate that the fusion of feature-warped DWT-MFCC and feature-warped MFCC is superior to other feature extraction techniques in the presence of environmental noise under the majority of signal-to-noise ratios (SNRs), reverberation, and noisy and reverberation conditions. At 0-dB SNR, the performance of the fusion of feature-warped DWT-MFCC and feature-warped MFCC approach achieves a reduction in average equal error rate of 21.33%, 20.00%, and 13.28% over feature-warped MFCC, respectively, in the presence of various types of environmental noises only, reverberation, and noisy and reverberation environments. The approach can be used for improving the performance of forensic speaker verification and it may be utilized for preparing legal evidence in court.
Highlights
The goal of speaker verification is to accept or reject the identity claim of a speaker by analyzing their speech samples [1], [2]
This section describes the effectiveness of fusion features of mel frequency cepstral coefficients (MFCCs) and discrete wavelet transform (DWT)-MFCC with and without feature warping on the speaker verification performance under noisy, reverberation, and noisy and reverberation conditions
The effect of level decomposition and duration utterances on the performance of fusion feature warping with MFCC and DWT-MFCC based speaker verification systems will be described
Summary
The goal of speaker verification is to accept or reject the identity claim of a speaker by analyzing their speech samples [1], [2]. Speaker verification can be used in many applications such as security, access control, and forensic applications [3]. Lawyers, judges, and law enforcement agencies have wanted to use forensic speaker verification when investigating a suspect or confirming the judgment of guilt or innocence [4]. Forensic speaker verification compares speech samples from a suspect (speech trace) with a database of speech samples of known criminals to prepare legal evidence for the court [5]. In real forensic applications, the speech traces provided to the system are often corrupted by various types of environmental noise such as car and street noises [5]. The performance of speaker verification systems reduces dramatically in the presence of high levels of noise [6], [7]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have