Shack–Hartmann wavefront sensors (SHWFSs) are crucial for detecting distortions in adaptive optics systems, but the accuracy of wavefront reconstruction is often hampered by low guide star brightness or strong atmospheric turbulence. This study introduces a new method of using the Vision Transformer model to process image information from SHWFSs. Compared with previous traditional methods, this model can assign a weight value to each subaperture by considering the position and image information of each subaperture of this sensor, and it can process to obtain wavefront reconstruction results. Comparative evaluations using simulated SHWFS light intensity images and corresponding deformable mirror command vectors demonstrate the robustness and accuracy of the Vision Transformer under various guide star magnitudes and atmospheric conditions, compared to convolutional neural networks (CNNs), represented in this study by Residual Neural Network (ResNet), which are widely used by other scholars. Notably, normalization preprocessing significantly improves the CNN performance (improving Strehl ratio by up to 0.2 under low turbulence) while having a varied impact on the Vision Transformer, improving its performance under a low turbulence intensity and high brightness (Strehl ratio up to 0.8) but deteriorating under a high turbulence intensity and low brightness (Strehl ratio reduced to about 0.05). Overall, the Vision Transformer consistently outperforms CNN models across all tested conditions, enhancing the Strehl ratio by an average of 0.2 more than CNNs.