In the field of face anti-spoofing (FAS), how to extract the representative features to distinguish between real and spoof faces and train the corresponding deep networks are two vital issues. In this paper, we propose a simple but effective end-to-end FAS model based on an innovative texture extractor and a depth auxiliary supervision mechanism. In the feature extraction stage, we first design the residual gradient convolutions based on the redesigned gradient operators, which are used to extract fine-grained texture features. The extraction of texture features is based on multiple scales by dividing the texture differences between living and spoofing faces into three levels reasonably. Then we construct a multiscale residual gradient attention (MRGA) to obtain representative texture features from multiple levels texture features. By combining the proposed feature extractor MRGA and existing vision transformer (ViT), the MRGA-ViT is proposed to generate related semantics and obtain final classification results. In the training stage, we also propose a local depth auxiliary supervision based on a novel adjacent depth loss, which utilizes the correlation information of adjacent pixels adequately compared with traditional depth loss. The proposed MRGA-ViT model achieves competitive performance in generalization and stability ability, e.g., the ACER(%) values of intra testing on OULU-NPU database are 1.8, 2.6, 1.6 ± 1.2 and 1.9 ± 2.7 respectively, the AUC(%) of cross type testing attains 99.45 ± 0.57, the ACER(%) values of cross dataset testing are 28.1 and 36.7 respectively. Experimental results prove that the proposed model is competitive to other state-of-the-art works on generalization and stability performance.
Read full abstract