Finger vein recognition has been widely studied due to its advantages, such as high security, convenience, and living body recognition. At present, the performance of the most advanced finger vein recognition methods largely depends on the quality of finger vein images. However, when collecting finger vein images, due to the possible deviation of finger position, ambient lighting and other factors, the quality of the captured images is often relatively low, which directly affects the performance of finger vein recognition. In this study, we proposed a new model for finger vein recognition that combined the vision transformer architecture with the capsule network (ViT-Cap). The model can explore finger vein image information based on global and local attention and selectively focus on the important finger vein feature information. First, we split-finger vein images into patches and then linearly embedded each of the patches. Second, the resulting vector sequence was fed into a transformer encoder to extract the finger vein features. Third, the feature vectors generated by the vision transformer module were fed into the capsule module for further training. We tested the proposed method on four publicly available finger vein databases. Experimental results showed that the average recognition accuracy of the algorithm based on the proposed model was above 96%, which was better than the original vision transformer, capsule network, and other advanced finger vein recognition algorithms. Moreover, the equal error rate (EER) of our model achieved state-of-the-art performance, especially reaching less than 0.3% under the test of FV-USM datasets which proved the effectiveness and reliability of the proposed model in finger vein recognition.