Abstract

The automatic speaker verification (ASV) has achieved significant progress in recent years. However, it is still very challenging to generalize the ASV technologies to new, unknown and spoofing conditions. Most previous studies focused on extracting the speaker information from natural speech. This paper attempts to address the speaker verification from another perspective. The speaker identity information was exploited from singing speech. We first designed and released a new corpus for speaker verification based on singing and normal reading speech. Then, the speaker discrimination was compared and analyzed between natural and singing speech in different feature spaces. Furthermore, the conventional Gaussian mixture model, the dynamic time warping and the state-of-the-art deep neural network were investigated. They were used to build text-dependent ASV systems with different training-test conditions. Experimental results show that the voiceprint information in the singing speech was more distinguishable than the one in the normal speech. More than relative 20% reduction of equal error rate was obtained on both the gender-dependent and independent 1 s-1 s evaluation tasks.

Highlights

  • Automatic speaker verification (ASV) is the verification of a speaker’s identify based on his/her speech signals [1]

  • We can observe that the overlap of orange dots and blue circles in the left subfigure is less than that in the right subfigure. This indicates that the speaker discrimination in Mel-frequency cepstral coefficients (MFCCs) feature space of singing speech is larger than the discrimination in reading speech feature space

  • The performances are reported in terms of equal error rate (EER) [1], a verification error measure that gives the accuracy at decision threshold for which the probabilities of false rejection and false acceptance are equal

Read more

Summary

Introduction

Automatic speaker verification (ASV) is the verification of a speaker’s identify based on his/her speech signals [1]. We have not found any previous works that examine and compare the speaker verification performances between using the natural Mandarin reading speech and singing speech. The big difference between this work and the above-mentioned previous works is that we focus on examining and comparing the effectiveness of using normal Mandarin reading speech and their corresponding singing speech for short-time text-dependent speaker verification. We designed a new corpus for short-time text-dependent ASV experiments We released it on the Zenodo website (https://zenodo.org/record/3241566) and put our implementation code in the Github repository (https://github.com/Moonmore/Speaker-Verification) for public research. Based on this corpus, we performed the text-dependent (TD) ASV comparison experiments using either the natural speech or the singing speech, or both of them. Preliminary results show that the voiceprint information in the singing speech was more distinguishable than the one in the natural reading speech for the short-time gender-dependent as well as independent ASV tasks

Corpus
Speaker Identity Discrimination in Different Feature Space
Pitch Discrimination
MFCC Discrimination
Speaker Verification Systems
Experimental Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call