Abstract
Speaker recognition deals with recognizing speakers by their speech. Most speaker recognition systems are built upon two stages, the first stage extracts low dimensional correlation embeddings from speech, and the second performs the classification task. The robustness of a speaker recognition system mainly depends on the extraction process of speech embeddings, which are primarily pre-trained on a large-scale dataset. As the embedding systems are pre-trained, the performance of speaker recognition models greatly depends on domain adaptation policy, which may reduce if trained using inadequate data. This paper introduces a speaker recognition strategy dealing with unlabeled data, which generates clusterable embedding vectors from small fixed-size speech frames. The unsupervised training strategy involves an assumption that a small speech segment should include a single speaker. Depending on such a belief, a pairwise constraint is constructed with noise augmentation policies, used to train AutoEmbedder architecture that generates speaker embeddings. Without relying on domain adaption policy, the process unsupervisely produces clusterable speaker embeddings, termed unsupervised vectors (u-vectors). The evaluation is concluded in two popular speaker recognition datasets for English language, TIMIT, and LibriSpeech. Also, a Bengali dataset is included to illustrate the diversity of the domain shifts for speaker recognition systems. Finally, we conclude that the proposed approach achieves satisfactory performance using pairwise architectures.
Highlights
Introduction published maps and institutional affilSpeech is the most engaging and acceptable form of communication among one another
We extend our investigation towards finding such Deep learning (DL) training strategy
We provide a detailed evaluation of embedding accuracy based on various levels of cluster impurities
Summary
Speaker recognition has been a topic of interest over the past decades, and various systems have been proposed to solve the challenge. Deep vector (d-vector) [14] is a mutated implementation of the speech frame embeddings using deep neural networks (DNN). The mechanism is split into two parts in the upgraded version: a DNN that extracts embeddings and a separately trained classifier that classifies speakers. These studies’ limitation is that most of them require massive labelled data in the training procedure. As DNN architectures are dependent on the amount of training data, an improved strategy of the d-vector is proposed, named x-vector [15]. The concept of unsupervision still depends on a large set of training data Both d-vector and x-vectors directly rely on the domain adaptation [30] policy of neural network architectures. The proposed method tends to utilize the automated feature extraction of neural network
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.