Abstract

Voice-driven devices (VDDs) like Google Home and Amazon Alexa, which are well-known connected devices in consumer IoT, have applications in various domains i.e., home appliances automation, next-generation vehicles, voice banking, and so on. However, these VDDs that are based on automatic speaker verification systems (ASVs) are vulnerable to voice based logical access (LA) attacks like Text-to-Speech (TTS) synthesis and converted voice signals. Intruders can exploit these attacks to bypass the security of such systems and gain access of victim’s bank account or home control. Thus, there exists a need to develop an effective voice spoofing countermeasure that can reliably be used to protect these VDDs against such malicious attacks. This work presents a novel audio features descriptor named as extended local ternary pattern (ELTP) to capture the vocal tract dynamically induced attributes of bonafide speech and algorithmic artifacts in synthetic and converted speeches. We fused our novel ELTP features with the linear frequency cepstral coefficients (LFCC) to further strengthen the capability of our features for capturing the traits of bonafide and spoofed signals. We employ the proposed ELTP-LFCC features to train the deep bidirectional Long Short-Term Memory (DBiLSTM) network for classification of the bonafide and spoof signal (i.e., TTS synthesis, converted speech). Performance of our spoofing countermeasure is measured on the large-scale and diverse ASVspoof 2019 logical access dataset. Experimental results demonstrate that the proposed audio spoofing countermeasure can reliably be used to detect the LA spoofing attacks.

Highlights

  • We have witnessed a tremendous evolution in voice biometrics-based user authentication systems in the last few years

  • We presented a novel audio features descriptor extended local ternary pattern (ELTP) and fused it with linear frequency cepstral coefficients (LFCC) to better capture the characteristics of the vocal tract speech dynamics of bonafide voice and cloning algorithm artifacts

  • Performance evaluation on the diverse ASVspoof 2019-LA dataset demonstrates the significance of our system for reliable detection of logical access spoofing attacks

Read more

Summary

INTRODUCTION

We have witnessed a tremendous evolution in voice biometrics-based user authentication systems in the last few years. Voice biometrics-based user authentication systems are considered more feasible these days, these systems are susceptible to different malicious presentation/spoofing attacks i.e., speech synthesis, voice conversion, replays, etc. The motivation behind the proposed work is to develop an effective features representation scheme that is robust to above-mentioned limitations and can reliably detect the logical-access (LA) attacks in diverse scenarios. To address these issues, we develop a novel audio features descriptor named extended local ternary pattern (ELTP) where we propose an automated threshold computation approach based on calculating the standard deviation locally for each audio frame.

RELATED WORK
GMM Supervector Linear
Evaluation
PERFORMANCE EVALUATION OF ELTP AND LFCC FEATURES
PERFORMANCE COMPARISON AGAINST EXISTING LA SPOOFING DETECTION METHODS
Proposed Method
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call