Sheep face recognition is an efficient and non-contact identification technology; however, it has not been widely adopted due to the time-consuming and labor-intensive nature of traditional methods for collecting sheep face images. To address these issues, we developed a multi-view sheep face image acquisition device. A total of 50 experimental sheep were used in this study to construct a multi-view sheep face dataset. Moreover, a high-precision sheep face recognition model called T2T-ViT-SFR was developed in this study. T2T-ViT-SFR combines multiple optimization strategies to improve the performance of the model. Specifically, we introduced the Squeeze-and-Excitation (SE) attention mechanism into the backbone to enhance the model's ability to learn useful information while suppressing non-essential information. The LayerScale module was embedded in the transformer of the backbone to prevent model collapse as the network depth increased. Meanwhile, the Additive Angular Margin Loss (ArcFace) was integrated at the head to address the challenge posed by the high similarity among sheep faces. Experiments on the multi-view sheep face dataset showed that T2T-ViT-SFR achieved the best recognition performance after pre-training, with an accuracy of 95.9% and an F1-score of 95.5%, significantly superior to the sheep face recognition models proposed in previous studies. Compared to T2T-ViT-24, the accuracy and F1-score of T2T-ViT-SFR increased by 2.6% and 2.1%, respectively, demonstrating the effectiveness of the optimization strategies proposed in this study. This study provides a feasible new strategy for actual sheep identity recognition, achieving a complete recognition process from image acquisition to recognition result output.