With the growing availability of databases for face presentation attack detection, researchers are increasingly focusing on video-based face anti-spoofing methods that involve hundreds to thousands of images for training the models. However, there is currently no clear consensus on the optimal number of frames in a video to improve face spoofing detection. Inspired by the visual saliency theory, we present a video summarization method for face anti-spoofing detection that aims to enhance the performance and efficiency of deep learning models by leveraging visual saliency. In particular, saliency information is extracted from the differences between the Laplacian and Wiener filter outputs of the source images, enabling the identification of the most visually salient regions within each frame. Subsequently, the source images are decomposed into base and detail images, enhancing the representation of the most important information. Weighting maps are then computed based on the saliency information, indicating the importance of each pixel in the image. By linearly combining the base and detail images using the weighting maps, the method fuses the source images to create a single representative image that summarizes the entire video. The key contribution of the proposed method lies in demonstrating how visual saliency can be used as a data-centric approach to improve the performance and efficiency for face presentation attack detection. By focusing on the most salient images or regions within the images, a more representative and diverse training set can be created, potentially leading to more effective models. To validate the method’s effectiveness, a simple CNN–RNN deep learning architecture was used, and the experimental results showcased state-of-the-art performance on four challenging face anti-spoofing datasets.