Vision Transformers for Vein Biometric Recognition

Raul Garcia-Martin,Raul Sanchez-Reillo

doi:10.1109/access.2023.3252009

Abstract

In October 2020, Google researchers present a promising Deep Learning architecture paradigm for Computer Vision that outperforms the already standard Convolutional Neural Networks (CNNs) on multiple image recognition state-of-the-art datasets: Vision Transformers (ViTs). Based on the self-attention concept inherited from Natural Language Processing (NLP), this new structure surpasses the CNN image classification task on ImageNet, CIFAR-100, and VTAB, among others, when it is fine-tuned (Transfer Leaning) after a previous pre-training on larger datasets. In this work, we confirm this theory and move one step further over the CNN structures applied for Vascular Biometric Recognition (VBR): to the best of our knowledge, we introduce for the first time multiple pure pre-trained and fine-tuned Vision Transformers in this evolving biometric modality to address the challenge of the limited number of samples in VBR datasets. For this purpose, the ViTs have been trained to extract unique image features on the ImageNet-1k and ImageNet-21k and then fine-tuned for the four main existing VBR variants, i.e., finger, palm, hand dorsal, and wrist vein areas. Fourteen existing vascular datasets have been used to perform the vein identification task in the four previously mentioned modalities, based on the True-Positive Identification Rate (TPIR) and 75-25% train-test sets obtaining the following results: HKPU (99.52%), and FV-USM (99.1%); Vera (99.39%), and CASIA (96.00%); Bosphorus (99.86%); PUT-wrist (99.67%), and UC3M-CV1+CV2 (99.67%). Furthermore, we introduce UC3M-CV3: a hygienic contactless wrist database collected on smartphones and consisting of 5400 images from 100 different subjects. The promising results show the Vision Transformer’s versatility in VBR under Transfer Learning and reinforce this new Neural Network architecture paradigm.

Full Text