This paper uses the literature reading method to systematically sort out and introduce the basic principles of the three algorithms of the transformer model and their application in the field of image classification, which has high theoretical value and social value, and has strong reference for the development of the transformer model in the future. ViT is simply an innovation in computer vision based on the transformer model. It first separates an image into several local patches (16x16), and then maps each one to a feature vector. These vectors will be delivered to an encoder for polishing. Finally, a special token is appended to these vectors for integrating location information. The final prediction is based on these tokesn. Swin-T is a new Transformer architecture, which is proposed by Microsoft Research to improve the performance of computer vision tasks. It adopts a new windowed feature extraction strategy, which can maintain high accuracy while significantly reducing the amount of computation and memory consumption. It has achieved leading performance in multiple computer vision tasks, becoming one of the most advanced visual Transformer models. In computer vision image classification, the information is highly redundant, the lack of an image piece, may not make the model produce much confusion, the model can be inferred from the surrounding pixel information, masked autoencoder (MAE) is to mask a high proportion of image pieces, create a difficult learning task, the method is simple but extremely effective.