Presentation Attack Detection on Limited-Resource Devices Using Deep Neural Classifiers Trained on Consistent Spectrogram Fragments.

Kacper Kubicki,Paweł Kapusta,Krzysztof Ślot

doi:10.3390/s21227728

Kacper Kubicki, Paweł Kapusta + Show 1 more

Open Access

https://doi.org/10.3390/s21227728

Copy DOI

Journal: Sensors (Basel, Switzerland)	Publication Date: Nov 20, 2021
Citations: 1	License type: CC BY 4.0

Affiliation: Lodz University of Technology

Abstract

The presented paper is concerned with detection of presentation attacks against unsupervised remote biometric speaker verification, using a well-known challenge–response scheme. We propose a novel approach to convolutional phoneme classifier training, which ensures high phoneme recognition accuracy even for significantly simplified network architectures, thus enabling efficient utterance verification on resource-limited hardware, such as mobile phones or embedded devices. We consider Deep Convolutional Neural Networks operating on windows of speech Mel-Spectrograms as a means for phoneme recognition, and we show that one can boost the performance of highly simplified neural architectures by modifying the principle underlying training set construction. Instead of generating training examples by slicing spectrograms using a sliding window, as it is commonly done, we propose to maximize the consistency of phoneme-related spectrogram structures that are to be learned, by choosing only spectrogram chunks from the central regions of phoneme articulation intervals. This approach enables better utilization of the limited capacity of the considered simplified networks, as it significantly reduces a within-class data scatter. We show that neural architectures comprising as few as dozens of thousands parameters can successfully—with accuracy of up to 76%, solve the 39-phoneme recognition task (we use the English language TIMIT database for experimental verification of the method). We also show that ensembling of simple classifiers, using a basic bagging method, boosts the recognition accuracy by another 2–3%, offering Phoneme Error Rates at the level of 23%, which approaches the accuracy of the state-of-the-art deep neural architectures that are one to two orders of magnitude more complex than the proposed solution. This, in turn, enables executing reliable presentation attack detection, based on just few-syllable long challenges on highly resource-limited computing hardware.

Highlights

Remote biometric user verification becomes the predominant access control technology, due to the widespread use of mobile devices and attempts to develop convenient, yet reliable ways for securing access to resources and services [1]
We show that the proposed presentation attack detection (PAD) algorithm that uses compact Convolutional Neural Networks (CNNs) classifiers trained using the central-window scheme enables achieving low, 23% Phoneme Error Rates (PER)
Having derived a resource-friendly, yet accurate phoneme recognition algorithm, we show that its application to the verification of prompted texts enables presentation attack detection based on few-syllable utterances, with over 99% confidence

Summary

Introduction

Remote biometric user verification becomes the predominant access control technology, due to the widespread use of mobile devices and attempts to develop convenient, yet reliable ways for securing access to resources and services [1]. A multitude of biometric traits have been successfully considered for identity resolution from data captured by mobile device cameras (face appearance, palm shape and papillary ridges, ear shape) and microphones (voice) [2]. Both sources of information can be used in a complementary, multi-modal recognition scheme, with the significance of individual sources weighted by input data quality. A natural means for presentation attack detection (PAD) in the case of voice modality, i.e., in a speaker authentication context, is a Challenge–Response (CR) scheme that attempts to validate the uttering of some system-prompted phrases [3]. Utterances to be generated should be short, to make liveness detection fast and unobtrusive, and hard to predict, to resist replay attacks, so random syllable sequences make good candidates for a challenge

Results

Discussion

Conclusion