Abstract

Statistical analysis of speech is an emerging area of machine learning. In this paper, we tackle the biometric challenge of Automatic Speaker Verification (ASV) of differentiating between samples generated by two distinct populations of utterances, those of an authentic human voice and those generated by a synthetic one. Solving such an issue through a statistical perspective foresees the definition of a decision rule function and a learning procedure to identify the optimal classifier. Classical state-of-the-art countermeasures rely on strong assumptions such as stationarity or local-stationarity of speech that may be atypical to encounter in practice. We explore in this regard a robust non-linear and non-stationary signal decomposition method known as the Empirical Mode Decomposition combined with the Mel-Frequency Cepstral Coefficients in a novel fashion with a refined classifier technique known as multi-kernel Support Vector machine. We undertake significant real data case studies covering multiple ASV systems using different datasets, including the ASVSpoof 2019 challenge database. The obtained results overwhelmingly demonstrate the significance of our feature extraction and classifier approach versus existing conventional methods in reducing the threat of cyber-attack perpetrated by synthetic voice replication seeking unauthorised access.

Highlights

  • T HE prevalence of biometric authentication systems is increasing in many data access points in smart devices and remote data access settings

  • We look at a novel machine learning solution for Automatic Speaker Verification (ASV) that is designed around feature extraction for speech signals, and we address the challenge of biometric cyber-attack mitigation by seeking to detect when data access is attempted through a deep fake artificial speech generation rather than a human speaker

  • We avoid the problem of mode-mixing by first performing the Empirical Mode Decomposition (EMD) basis decomposition of the speech signal, we study the Mel Frequency Cepstral Coefficients (MFCCs) of each Intrinsic Mode Functions (IMFs) basis

Read more

Summary

INTRODUCTION

T HE prevalence of biometric authentication systems is increasing in many data access points in smart devices and remote data access settings. The third contribution foresees the performance comparisons between benchmark ASV features extracted on the raw data and on the EMD basis functions to highlight that speech is highly non-stationary and that multiple situations generate adverse environments that require the use of an adaptive method relying on the given data system. We demonstrate that such basis functions are more amenable to classical speech feature extraction methods in the transformed cepstral domain This allows us to develop new approaches to EMD-Mel Cepstral speech signature characterization that we demonstrate is highly effective in capturing individual speakers vocal tract specificities that arise given a speakers glottal airflow shaped by the vocal tract filter as it passes through it to produce speech. The EMD basis functions will accommodate non-stationarity of any level and so produce more robust features

NOTATION
EMPIRICAL MODE DECOMPOSITION FOR NON-STATIONARY FEATURES
OBTAINING INSTANTANEOUS FREQUENCY FROM IMF BASIS FUNCTIONS
INTERPRETING EMD BASIS DECOMPOSITION
CLASSES OF FEATURES TO REPRESENT SPEECH SIGNALS
CLASSIFICATION FRAMEWORK
INTERPRETATION THROUGH REGULARISED LOSS FUNCTION
INTERPRETATION THROUGH BAYESIAN DECISION
EXPERIMENTAL SET-UP
Comparable Dataset
EXPERIMENT ONE
EXPERIMENT TWO
EXPERIMENT THREE
Evaluation
Findings
VIII. DISCUSSION AND CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call