Speaker Recognition Research Articles

With technological development, human–computer interaction (HCI) has improved, and spoken communication among machines and humans is one solution to enhance and expedite this process. Researchers have recently explored several systems to improve speech and speaker recognition performance in recent decades. A crucial threat in HCI is developing models that can effectually listen and respond like humans. It resulted in the development of the automated speech emotion recognition (SER) method, which can recognize various emotional classes by electing and extracting effectual features from speech signals. The fundamental problem of automated speech detection is the considerable variation in speech signals because of distinct speakers, language differences, speech differences, contents and acoustic conditions, voice modulation differences based on age and gender. With enhancements in deep learning (DL) and the affordability of computational resources, specifically graphical processing units (GPUs), research underwent a paradigm shift. Therefore, this study develops a multi-class automated speech language recognition using natural language processing with optimal deep learning (MASLR-NLPODL) technique. The MASLR-NLPODL technique intends to accomplish the efficient identification of different spoken languages. In the MASLR-NLPODL technique, the initial preprocessing technique involves windowing, frame blocking, and pre-emphasis block. Next, an adaptive time-frequency feature extractor approach utilizing the discrete fractional Fourier transform (DFrFT) was applied, which can be attained by extending the discrete Fourier transform (DFT) with eigenvectors. An improved Harris hawks optimization (IHHO) technique can be employed to select effectual features. Moreover, the classification of spoken languages can be performed by the gated recurrent unit (GRU) model. Finally, the salp swarm algorithm (SSA)-based hyperparameter selection process is involved in enhancing the performance of the GRU model. The design of the IHHO-based feature selection and SSA-based hyperparameter tuning process demonstrates the novelty of the work. The performance evaluation of the MASLR-NLPODL technique takes place under the VoxForge Dataset. The experimental validation of the MASLR-NLPODL technique exhibited a superior accuracy outcome of 96.40% over existing techniques.

Read full abstract

Abstract Automatic Speaker Verification (ASV) technology is increasingly being used in end-user applications to secure access to personal data, smart services, and physical infrastructure. Speaker verification, like other biometric technologies, is vulnerable to spoofing attacks. An attacker impersonates a specific target speaker using impersonation, replay, Text-to-Speech (TTS), or Voice conversion (VC) techniques to gain unauthorized access to the system. The work in this paper, proposes a solution that uses an amalgamation of Cochleagram and Residual Network (ResNet) to implement the front-end feature extraction phase of an Audio Spoof Detection (ASD) system. Cochleagram generation, feature extraction-dimensionality reduction and classification are the three main phases of the proposed ASD system. In the first phase, the recorded audios have been converted into Cochleagrams by using Equivalent Rectangular Bandwidth (ERB) based gammatone filters. In the next phase, three variants of Residual Networks (ResNet), ResNet50, ResNet41 and ResNet27, one by one, have been used for extracting dynamic features. These models yield 2048, 1024 and 256 features, respectively, for a single audio. The feature extracted from ResNet50 and ResNet41 are input to LDA technique for dimensionality reduction. At last, in the classification phase, the LDA reduced features have been used for training four different machine learning classifiers Random Forest, Naïve Bayes, K-Nearest Neighbour (KNN), and eXtreme Gradient Boosting (XGBoost), individually. The proposed work in this paper concentrates on synthetic, replay, and deepfake attacks. The state-of-the-art ASVspoof 2019 Logical Access (LA), Physical Access (PA), Voice Spoofing Detection Corpus (VSDC) and DEepfake CROss-lingual (DECRO) datasets are utilised for training and testing the proposed ASD system. Additionally, we have assessed the performance of our proposed system under the influence of additive noise. Airplane noise at different SNR rate (0, dB 5 dB, 10 dB and −5 dB) has been added to training and testing audios for the same. From the obtained results, it can be concluded that combination of Cochleagram and ResNet50 with XGBoost classifier outperforms all other implemented systems for detecting fake audios under noisy environment. We also tested the proposed models in an unseen scenario, where they demonstrated reasonable performance.

Read full abstract

Speaker Recognition Research Articles

Related Topics

Articles published on Speaker Recognition

The other-accent effect on speaker recognition

Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related features

Self-distillation-based domain exploration for source speaker verification under spoofed speech from unknown voice conversion

Ghadeer-speech-crowd-corpus: Speech dataset.

MULTI-CLASS AUTOMATED SPEECH LANGUAGE RECOGNITION USING NATURAL LANGUAGE PROCESSING WITH OPTIMAL DEEP LEARNING MODEL

Lightweight noise robust spoofing attack detection using Cochleagram and ResNet amalgamated features

Speaker Verification Based on Channel Attention and Adaptive Joint Loss

Speaker Extraction with Verification of Present and Absent Target Speakers

Sample-independent federated learning backdoor attack in speaker recognition

Speaker-Specific Method of Spoofing Attack Detection Based on Anomaly Detection

Revolutionizing Speaker Recognition and Diarization: A Novel Methodology in Speech Analysis

Stereotyped accent judgements in forensic contexts: listener perceptions of social traits and types of behaviour

Lightweight Noise Robust Spoofing Attack Detection using Cochleagram and ResNet Amalgamated Features

Advancing Voice Anti-Spoofing Systems: Self-Supervised Learning and Indonesian Dataset Integration for Enhanced Generalization

Multi-language Speaker Recognition using Multi-feature Fusion and Deep Learning Technique: A Survey

ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

Biological, linguistic, and individual factors govern voice qualitya).

Multi-Noise Representation Learning for Robust Speaker Recognition

IMPROVING PERFORMANCE OF VIETNAMESE SPEAKER RECOGNITION USING TRANSFER LEARNING AND ENSEMBLE EMBEDDING

A Robust Deep Learning-Based Speaker Identification System Using Hybrid Model on KUI Dataset

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Speaker Recognition Research Articles

Related Topics

Articles published on Speaker Recognition

The other-accent effect on speaker recognition

Enhancing depression recognition through a mixed expert model by integrating speaker-related and emotion-related features

Self-distillation-based domain exploration for source speaker verification under spoofed speech from unknown voice conversion

Ghadeer-speech-crowd-corpus: Speech dataset.

MULTI-CLASS AUTOMATED SPEECH LANGUAGE RECOGNITION USING NATURAL LANGUAGE PROCESSING WITH OPTIMAL DEEP LEARNING MODEL

Lightweight noise robust spoofing attack detection using Cochleagram and ResNet amalgamated features

Speaker Verification Based on Channel Attention and Adaptive Joint Loss

Speaker Extraction with Verification of Present and Absent Target Speakers

Sample-independent federated learning backdoor attack in speaker recognition

Speaker-Specific Method of Spoofing Attack Detection Based on Anomaly Detection

Revolutionizing Speaker Recognition and Diarization: A Novel Methodology in Speech Analysis

Stereotyped accent judgements in forensic contexts: listener perceptions of social traits and types of behaviour

Lightweight Noise Robust Spoofing Attack Detection using Cochleagram and ResNet Amalgamated Features

Advancing Voice Anti-Spoofing Systems: Self-Supervised Learning and Indonesian Dataset Integration for Enhanced Generalization

Multi-language Speaker Recognition using Multi-feature Fusion and Deep Learning Technique: A Survey

ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

Biological, linguistic, and individual factors govern voice qualitya).

Multi-Noise Representation Learning for Robust Speaker Recognition

IMPROVING PERFORMANCE OF VIETNAMESE SPEAKER RECOGNITION USING TRANSFER LEARNING AND ENSEMBLE EMBEDDING

A Robust Deep Learning-Based Speaker Identification System Using Hybrid Model on KUI Dataset